데이터셋 다운로드

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/


UCI Wine Quality Data set

Attribute information

  • 1 - fixed acidity
  • 2 - volatile acidity
  • 3 - citric acid
  • 4 - residual sugar
  • 5 - chlorides
  • 6 - free sulfur dioxide
  • 7 - total sulfur dioxide
  • 8 - density
  • 9 - pH
  • 10 - sulphates
  • 11 - alcohol
  • 12 - quality (score between 0 and 10)

코드 구현 


CSV 데이터 확인.


import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

wine_data = pd.read_csv('winequality-white.csv',delimiter=';',dtype=float)

wine_data.head(10)



데이터 자르기 및 qulity 변수 값 변경.


x_data = wine_data.iloc[:,0:-1]

y_data = wine_data.iloc[:,-1]



# Score 값이 7보다 작으면 0,  7보다 크거나 같으면 1로 값 변경.

y_data = np.array([1 if i>=7 else 0 for i in y_data])

x_data.head(5)


# 트레인, 테스트 데이터 나누기.

train_x, test_x, train_y, test_y = sklearn.model_selection.train_test_split(x_data, y_data, test_size = 0.3,random_state=42)


KNN


모델 구축


from sklearn.neighbors import KNeighborsClassifier 

                                  

clf = KNeighborsClassifier(n_neighbors=2,metric= 'euclidean')

clf.fit(train_x,train_y) 



Confusion Matrix


confusion = confusion_matrix(test_y,y_pred_test)

print("confusion_matrix\n{}".format(confusion))



성능


y_pred_train = clf.predict(train_x)

y_pred_test = clf.predict(test_x)

y_pred_test2 = clf.predict_proba(test_x)


print("Train Data:", accuracy_score(train_y, y_pred_train))

print("Test Data" , accuracy_score(test_y, y_pred_test))



from sklearn.metrics import classification_report

y_true, y_pred = test_y, clf.predict(test_x)

print(classification_report(y_true, y_pred))


ROC CURVE


fpr, tpr, thresholds = roc_curve(test_y, y_pred_test2[:,1], pos_label=1)

roc_auc = auc(fpr, tpr)


plt.title('Receiver Operating Characteristic')

plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)

plt.legend(loc = 'lower right')

plt.plot([0, 1], [0, 1],'r--')

plt.xlim([0, 1])

plt.ylim([0, 1])

plt.ylabel('True Positive Rate')

plt.xlabel('False Positive Rate')

plt.show()



최적 K 찾기


from sklearn.datasets import load_breast_cancer


training_accuracy = []

test_accuracy = []

# 1에서 10까지 n_neighbors를 적용

neighbors_settings = range(1, 20)


for n_neighbors in neighbors_settings:

    # 모델 생성

    clf = KNeighborsClassifier(n_neighbors=n_neighbors)

    clf.fit(train_x, train_y)

    # 훈련 세트 정확도 저장

    training_accuracy.append(clf.score(train_x, train_y))

    # 일반화 정확도 저장

    test_accuracy.append(clf.score(test_x, test_y))


plt.plot(neighbors_settings, training_accuracy, label="train accuracy")

plt.plot(neighbors_settings, test_accuracy, label="test accuracy")

plt.ylabel("accuracy")

plt.xlabel("n_neighbors")

plt.legend()











+ Recent posts