Boston House Prices dataset을 이용한 Multiple linear Regression.
Attribute information
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per 10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT lower status of the population
- MEDV Median value of owner-occupied homes in 1000's
from sklearn.datasets import load_boston
boston = load_boston()
dfX = pd.DataFrame(boston.data, columns=boston.feature_names)
dfy = pd.DataFrame(boston.target, columns=["MEDV"])
df_house = pd.concat([dfX, dfy], axis=1)
df_house.tail(10)
DataFrame의 describe()을 이용한 데이터 요약.
df_house.describe()
Seaborn을 이용한 pairplot
columns = ["ZN", "NOX", "RM", "MEDV"]
sns.pairplot(df_house[columns])
plt.show()
테스트, 검증 데이터 셋 구성과 학습 모델 생성
# X ,Y 데이터 나누기.
x_data = df_house.iloc[:,0:-1]
y_data = df_house.iloc[:,-1]
x_data.tail(5)
#Train , Test 데이터 분할하기.
#방법 1
# ratio = 0.7
# num_data = int(len(df_house)* ratio)
# train_X, test_X = x_data[:num_data], x_data[num_data:] #slicing
# train_y, test_y = y_data[:num_data], y_data[num_data:]
#방법 2 ( Train data = 67%, Test data = 33% )
train_x, test_x, train_y, test_y = sklearn.model_selection.train_test_split(x_data, y_data, test_size = 0.33)
# LinearRegression 모델 생성
m_reg = LinearRegression(fit_intercept = True) #fit_intercept -> beta0 생성 여부
# 모델 학습
m_reg.fit(train_x, train_y)
# 테스트 데이터를 이용해 예측하기.
y_pred = m_reg.predict(test_x)
print(m_reg.score(test_x, test_y))
# 출력 결과
# 0.6704720247887545
OLS 방식
m_reg = sm.OLS(train_y, train_x).fit()
# Print out the statistics
m_reg.summary()
Mean Squared Error 구하기.
Y_pred = m_reg.predict(test_x)
mse = sklearn.metrics.mean_squared_error(test_y, Y_pred)
print(mse)
# 출력 결과
# 25.67338712126145
예측값과 실제 값에 대한 Scatter plot 그리기
plt.scatter(test_y, y_pred)
line = np.linspace(min(test_y), max(test_y), 1000)
plt.plot(line, line, color = 'r')
plt.xlabel('Test_value')
plt.ylabel('Pred_value')
plt.show()
'Study > Data Science' 카테고리의 다른 글
Supervised learning, UnSupervised learning Models (0) | 2018.08.03 |
---|---|
python을 이용한 Wine Quality dataset Naive Bayesain GaussianNB & BernoulliNB (0) | 2018.07.02 |
python을 이용한 Wine Quality dataset KNN (0) | 2018.06.18 |
python을 이용한 Wine Quality dataset Decision Tree (0) | 2018.06.05 |
python을 이용한 Wine Quality dataset Logistic Regression (0) | 2018.06.05 |