Python을 이용한 Boston House Prices dataset Multiple linear Regression.

2018. 6. 4. 16:39

Boston House Prices dataset을 이용한 Multiple linear Regression.

Attribute information

CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per 10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT lower status of the population
MEDV Median value of owner-occupied homes in 1000's

import

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib as mpl

import matplotlib.pylab as plt

from sklearn.linear_model import LinearRegression

import statsmodels.api as sm

import scipy.stats as stats

import sklearn

import statsmodels.api as sm

데이터 셋 불러오기.

from sklearn.datasets import load_boston

boston = load_boston()

dfX = pd.DataFrame(boston.data, columns=boston.feature_names)

dfy = pd.DataFrame(boston.target, columns=["MEDV"])

df_house = pd.concat([dfX, dfy], axis=1)

df_house.tail(10)

DataFrame의 describe()을 이용한 데이터 요약.

df_house.describe()

Seaborn을 이용한 pairplot

columns = ["ZN", "NOX", "RM", "MEDV"]

sns.pairplot(df_house[columns])

plt.show()

테스트, 검증 데이터 셋 구성과 학습 모델 생성

# X ,Y 데이터 나누기.

x_data = df_house.iloc[:,0:-1]

y_data = df_house.iloc[:,-1]

x_data.tail(5)

#Train , Test 데이터 분할하기.

#방법 1

# ratio = 0.7

# num_data = int(len(df_house)* ratio)

# train_X, test_X = x_data[:num_data], x_data[num_data:] #slicing

# train_y, test_y = y_data[:num_data], y_data[num_data:]

#방법 2 ( Train data = 67%, Test data = 33% )

train_x, test_x, train_y, test_y = sklearn.model_selection.train_test_split(x_data, y_data, test_size = 0.33)

# LinearRegression 모델 생성

m_reg = LinearRegression(fit_intercept = True) #fit_intercept -> beta0 생성 여부

# 모델 학습

m_reg.fit(train_x, train_y)

# 테스트 데이터를 이용해 예측하기.

y_pred = m_reg.predict(test_x)

print(m_reg.score(test_x, test_y))

# 출력 결과

# 0.6704720247887545

OLS 방식

m_reg = sm.OLS(train_y, train_x).fit()

# Print out the statistics

m_reg.summary()

Mean Squared Error 구하기.

Y_pred = m_reg.predict(test_x)

mse = sklearn.metrics.mean_squared_error(test_y, Y_pred)

print(mse)

# 출력 결과

# 25.67338712126145

예측값과 실제 값에 대한 Scatter plot 그리기

plt.scatter(test_y, y_pred)

line = np.linspace(min(test_y), max(test_y), 1000)

plt.plot(line, line, color = 'r')

plt.xlabel('Test_value')

plt.ylabel('Pred_value')

plt.show()

'Study > Data Science' 카테고리의 다른 글

Supervised learning, UnSupervised learning Models (0)	2018.08.03
python을 이용한 Wine Quality dataset Naive Bayesain GaussianNB & BernoulliNB (0)	2018.07.02
python을 이용한 Wine Quality dataset KNN (0)	2018.06.18
python을 이용한 Wine Quality dataset Decision Tree (0)	2018.06.05
python을 이용한 Wine Quality dataset Logistic Regression (0)	2018.06.05

Deeppp