Boston House Prices dataset을 이용한 Multiple linear Regression.


Attribute information

  • CRIM per capita crime rate by town
  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS proportion of non-retail business acres per town
  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX nitric oxides concentration (parts per 10 million)
  • RM average number of rooms per dwelling
  • AGE proportion of owner-occupied units built prior to 1940
  • DIS weighted distances to five Boston employment centres
  • RAD index of accessibility to radial highways
  • TAX full-value property-tax rate per 10,000
  • PTRATIO pupil-teacher ratio by town
  • B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT lower status of the population
  • MEDV Median value of owner-occupied homes in 1000's

import
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pylab as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import scipy.stats as stats
import sklearn
import statsmodels.api as sm

데이터 셋 불러오기.

from sklearn.datasets import load_boston

boston = load_boston()

dfX = pd.DataFrame(boston.data, columns=boston.feature_names)

dfy = pd.DataFrame(boston.target, columns=["MEDV"])

df_house = pd.concat([dfX, dfy], axis=1)

df_house.tail(10)



DataFrame의 describe()을 이용한 데이터 요약.


df_house.describe()



Seaborn을 이용한 pairplot


columns = ["ZN", "NOX", "RM", "MEDV"]

sns.pairplot(df_house[columns])

plt.show()




테스트, 검증 데이터 셋 구성과 학습 모델 생성


# X ,Y 데이터 나누기.

x_data = df_house.iloc[:,0:-1]

y_data = df_house.iloc[:,-1]

x_data.tail(5)


#Train , Test 데이터 분할하기.


#방법 1

# ratio = 0.7

# num_data = int(len(df_house)* ratio)

# train_X, test_X = x_data[:num_data], x_data[num_data:] #slicing

# train_y, test_y = y_data[:num_data], y_data[num_data:]


#방법 2 ( Train data = 67%,  Test data = 33% )

train_x, test_x, train_y, test_y = sklearn.model_selection.train_test_split(x_data, y_data, test_size = 0.33)


# LinearRegression 모델 생성

m_reg = LinearRegression(fit_intercept = True) #fit_intercept -> beta0 생성 여부

# 모델 학습

m_reg.fit(train_x, train_y) 


# 테스트 데이터를 이용해 예측하기.

y_pred = m_reg.predict(test_x)

print(m_reg.score(test_x, test_y))

# 출력 결과

# 0.6704720247887545


OLS 방식


m_reg = sm.OLS(train_y, train_x).fit()

# Print out the statistics

m_reg.summary()

 


Mean Squared Error 구하기.


Y_pred = m_reg.predict(test_x)

mse = sklearn.metrics.mean_squared_error(test_y, Y_pred)

print(mse)

# 출력 결과

# 25.67338712126145


예측값과 실제 값에 대한 Scatter plot 그리기


plt.scatter(test_y, y_pred)

line = np.linspace(min(test_y), max(test_y), 1000)

plt.plot(line, line, color = 'r')

plt.xlabel('Test_value')

plt.ylabel('Pred_value')

plt.show()



+ Recent posts