데이터셋 다운로드
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
UCI Wine Quality Data set
Attribute information
- 1 - fixed acidity
- 2 - volatile acidity
- 3 - citric acid
- 4 - residual sugar
- 5 - chlorides
- 6 - free sulfur dioxide
- 7 - total sulfur dioxide
- 8 - density
- 9 - pH
- 10 - sulphates
- 11 - alcohol
- 12 - quality (score between 0 and 10)
코드 구현
CSV 데이터 확인.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
wine_data = pd.read_csv('winequality-white.csv',delimiter=';',dtype=float)
wine_data.head(10)
데이터 자르기 및 qulity 변수 값 변경.
x_data = wine_data.iloc[:,0:-1]
y_data = wine_data.iloc[:,-1]
# Score 값이 7보다 작으면 0, 7보다 크거나 같으면 1로 값 변경.
y_data = np.array([1 if i>=7 else 0 for i in y_data])
x_data.head(5)
# 트레인, 테스트 데이터 나누기.
train_x, test_x, train_y, test_y = sklearn.model_selection.train_test_split(x_data, y_data, test_size = 0.3,random_state=42)
KNN
모델 구축
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=2,metric= 'euclidean')
clf.fit(train_x,train_y)
Confusion Matrix
confusion = confusion_matrix(test_y,y_pred_test)
print("confusion_matrix\n{}".format(confusion))
성능
y_pred_train = clf.predict(train_x)
y_pred_test = clf.predict(test_x)
y_pred_test2 = clf.predict_proba(test_x)
print("Train Data:", accuracy_score(train_y, y_pred_train))
print("Test Data" , accuracy_score(test_y, y_pred_test))
from sklearn.metrics import classification_report
y_true, y_pred = test_y, clf.predict(test_x)
print(classification_report(y_true, y_pred))
ROC CURVE
fpr, tpr, thresholds = roc_curve(test_y, y_pred_test2[:,1], pos_label=1)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
최적 K 찾기
from sklearn.datasets import load_breast_cancer
training_accuracy = []
test_accuracy = []
# 1에서 10까지 n_neighbors를 적용
neighbors_settings = range(1, 20)
for n_neighbors in neighbors_settings:
# 모델 생성
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(train_x, train_y)
# 훈련 세트 정확도 저장
training_accuracy.append(clf.score(train_x, train_y))
# 일반화 정확도 저장
test_accuracy.append(clf.score(test_x, test_y))
plt.plot(neighbors_settings, training_accuracy, label="train accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("accuracy")
plt.xlabel("n_neighbors")
plt.legend()
'Study > Data Science' 카테고리의 다른 글
Supervised learning, UnSupervised learning Models (0) | 2018.08.03 |
---|---|
python을 이용한 Wine Quality dataset Naive Bayesain GaussianNB & BernoulliNB (0) | 2018.07.02 |
python을 이용한 Wine Quality dataset Decision Tree (0) | 2018.06.05 |
python을 이용한 Wine Quality dataset Logistic Regression (0) | 2018.06.05 |
Python을 이용한 Boston House Prices dataset Multiple linear Regression. (0) | 2018.06.04 |