Forecasting House Sales Prices in Ames, Iowa: A Predictive Model based on Features and Proximity to Main Infrastructures¶

Putranegara Riauwindu, putrangr@bu.edu

Abstract:

This project presents the development of a predictive model for forecasting house sales prices in Ames, Iowa. The model utilizes essential house features, such as the number of bedrooms, bathrooms, and basement, in combination with the proximity to main infrastructure, such as roads and railways.

To construct and evaluate the model, we employ the comprehensive Ames Housing dataset compiled by Dean De Cock (accessible through this link : https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data). The forecasting mechanism leverages the Multiple Linear Regression algorithms, which considers the selected features' collective impact on the predicted response. For enhanced accuracy, the Stepwise Regression technique is utilized for feature selection, identifying the most significant contributors. Please note that the Ames Housing dataset comes with two separate file of train and test set. We will use train set to build the model and find the model with best performance and make prediction of house SalePrice in Ames using of the best model on the test set.

To ensure dataset reliability and integrity without compromising the dataset originality by minimizing bias, missing values within the dataset will only be imputed if the missing value does not exceed 10% out of all available value within that particular columns. Imputation will employ Multiple Linear Regression for continous variables and k Nearest Neighbors for categorical/ordinal variables. We will use the combination of matplotlib, seaborn, and plotly as a means of visualizing the analysis and model (if necessary).

The final output is a Multiple Linear Regression Model capable of predicting house sales prices in Ames based on specified features. This model offers valuable insights for prospective buyers and sellers in the Ames, Iowa real estate market.

Keywords: Predictive Model, House Sales Prices, Ames, Iowa, House Features, Proximity, Infrastructure, Multiple Linear Regression, Stepwise Regression, Imputation, KNN.

1. Importing Relevant Packages and Configuring Initial Setting for Jupyter Notebook¶

In [1]:
# Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
import missingno as msno
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
import plotly.express as px
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
import time
from sklearn.impute import SimpleImputer
from mlxtend.feature_selection import SequentialFeatureSelector

# Output Display Setting
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
warnings.filterwarnings("ignore", category=UserWarning)

Setting Plotly Display Configuration

In [2]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

2. Importing and Handling Missing Value in Training Dataset¶

2.1 Importing Dataset into Python Environment¶

In [3]:
data = pd.read_csv('train.csv')
data.shape
Out[3]:
(1460, 81)
In [4]:
data.head()
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000

The training dataset contains 1460 observations with 80 total feature columns (excluding the response column which is the 'SalePrice'.

We will now proceed to check whether there are missing values in the training set.

2.2 Checking for Missing Value¶

In [5]:
# Calculating missing value for all of the columns
missing_values = data.isna().sum()/data.shape[0]

# Filter columns with more than 0 missing values
columns_with_missing = missing_values[missing_values > 0].sort_values(ascending=False)

# Display the filtered result
print(columns_with_missing)
PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
GarageCond      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrArea      0.005479
MasVnrType      0.005479
Electrical      0.000685
dtype: float64
In [6]:
# Visualizing the training dataset missing values
pastel_colors = sns.color_palette("pastel")
msno.bar(data, color=pastel_colors)
plt.show()

The plot and calculation above showed that there are severals columns identified to having multiple missing values which are:

  1. PoolQC
  2. MiscFeature
  3. Alley
  4. Fence
  5. FireplaceQu
  6. LotFrontage
  7. GarageYrBlt
  8. GarageCond: Garage condition
  9. GarageType: Garage location
  10. GarageFinish: Interior finish of the garage
  11. GarageQual: Garage quality
  12. BsmtFinType2: Rating of basement finished area (if multiple types)
  13. BsmtExposure: Refers to walkout or garden level walls
  14. BsmtQual: Evaluates the height of the basement
  15. BsmtCond: Evaluates the general condition of the basement
  16. BsmtFinType1: Rating of basement finished area
  17. MasVnrArea: Masonry veneer area in square feet
  18. MasVnrType: Masonry veneer type
  19. Electrical: electrical system

After reviewing the dataset description, it appears that only the columns "LotFrontage" and "Electrical" contain genuine missing values. The other columns were previously identified to have apparent missing values due to the presence of "NA" values, but these "NA" values actually indicate the unavailability of a specific feature in those particular observations. Therefore, these "NA" values are not true missing values.

In summary, the dataset has genuine missing values in "LotFrontage" and "Electrical", while the "NA" values in other columns do not represent missing data but rather indicate the absence of certain features.

Due to the size of the missing values, the observations that contains missing value in column "LotFrontage" will be removed while the one in "Electrical" column will be imputed using KNN model in order to minimize bias. The missing value will be imputed first before we explore the dataset in the Exploratory Data Analysis (EDA) phase due to its only being one observation.

2.3 Handling Missing Value¶

2.3.1 Adjusting the NA values in the column that imply the absence of that particular feature

In [7]:
# Columns to exclude from imputation
exclude_columns = ['LotFrontage', 'Electrical', 'MasVnrArea', 'GarageYrBlt']
zero_columns = ['MasVnrArea', 'GarageYrBlt']

# Fill missing values with "Not Available" for non-excluded columns
data[data.columns.difference(exclude_columns)] = data[data.columns.difference(exclude_columns)].fillna("Not Available")

# Fill missing values in 'MasVnrArea' and 'GarageYrBlt' with 0 as they correspond to the associated related columns
data['MasVnrArea'].fillna(0, inplace=True)
data['GarageYrBlt'].fillna(0, inplace=True)

2.3.2 Checking whether the designated column`s missing value have been properly adjusted

In [8]:
# List of columns with categorical nature
categorical_vars = data.select_dtypes(include='object').columns.tolist()

# Loop through each categorical column and print unique categories
for col in categorical_vars:
    num_categories = data[col].unique()
    print(f"Column '{col}' has unique categories: {num_categories}")
Column 'MSZoning' has unique categories: ['RL' 'RM' 'C (all)' 'FV' 'RH']
Column 'Street' has unique categories: ['Pave' 'Grvl']
Column 'Alley' has unique categories: ['Not Available' 'Grvl' 'Pave']
Column 'LotShape' has unique categories: ['Reg' 'IR1' 'IR2' 'IR3']
Column 'LandContour' has unique categories: ['Lvl' 'Bnk' 'Low' 'HLS']
Column 'Utilities' has unique categories: ['AllPub' 'NoSeWa']
Column 'LotConfig' has unique categories: ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
Column 'LandSlope' has unique categories: ['Gtl' 'Mod' 'Sev']
Column 'Neighborhood' has unique categories: ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
Column 'Condition1' has unique categories: ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
Column 'Condition2' has unique categories: ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
Column 'BldgType' has unique categories: ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
Column 'HouseStyle' has unique categories: ['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin']
Column 'RoofStyle' has unique categories: ['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed']
Column 'RoofMatl' has unique categories: ['CompShg' 'WdShngl' 'Metal' 'WdShake' 'Membran' 'Tar&Grv' 'Roll'
 'ClyTile']
Column 'Exterior1st' has unique categories: ['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd'
 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc'
 'CBlock']
Column 'Exterior2nd' has unique categories: ['VinylSd' 'MetalSd' 'Wd Shng' 'HdBoard' 'Plywood' 'Wd Sdng' 'CmentBd'
 'BrkFace' 'Stucco' 'AsbShng' 'Brk Cmn' 'ImStucc' 'AsphShn' 'Stone'
 'Other' 'CBlock']
Column 'MasVnrType' has unique categories: ['BrkFace' 'None' 'Stone' 'BrkCmn' 'Not Available']
Column 'ExterQual' has unique categories: ['Gd' 'TA' 'Ex' 'Fa']
Column 'ExterCond' has unique categories: ['TA' 'Gd' 'Fa' 'Po' 'Ex']
Column 'Foundation' has unique categories: ['PConc' 'CBlock' 'BrkTil' 'Wood' 'Slab' 'Stone']
Column 'BsmtQual' has unique categories: ['Gd' 'TA' 'Ex' 'Not Available' 'Fa']
Column 'BsmtCond' has unique categories: ['TA' 'Gd' 'Not Available' 'Fa' 'Po']
Column 'BsmtExposure' has unique categories: ['No' 'Gd' 'Mn' 'Av' 'Not Available']
Column 'BsmtFinType1' has unique categories: ['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' 'Not Available' 'LwQ']
Column 'BsmtFinType2' has unique categories: ['Unf' 'BLQ' 'Not Available' 'ALQ' 'Rec' 'LwQ' 'GLQ']
Column 'Heating' has unique categories: ['GasA' 'GasW' 'Grav' 'Wall' 'OthW' 'Floor']
Column 'HeatingQC' has unique categories: ['Ex' 'Gd' 'TA' 'Fa' 'Po']
Column 'CentralAir' has unique categories: ['Y' 'N']
Column 'Electrical' has unique categories: ['SBrkr' 'FuseF' 'FuseA' 'FuseP' 'Mix' nan]
Column 'KitchenQual' has unique categories: ['Gd' 'TA' 'Ex' 'Fa']
Column 'Functional' has unique categories: ['Typ' 'Min1' 'Maj1' 'Min2' 'Mod' 'Maj2' 'Sev']
Column 'FireplaceQu' has unique categories: ['Not Available' 'TA' 'Gd' 'Fa' 'Ex' 'Po']
Column 'GarageType' has unique categories: ['Attchd' 'Detchd' 'BuiltIn' 'CarPort' 'Not Available' 'Basment' '2Types']
Column 'GarageFinish' has unique categories: ['RFn' 'Unf' 'Fin' 'Not Available']
Column 'GarageQual' has unique categories: ['TA' 'Fa' 'Gd' 'Not Available' 'Ex' 'Po']
Column 'GarageCond' has unique categories: ['TA' 'Fa' 'Not Available' 'Gd' 'Po' 'Ex']
Column 'PavedDrive' has unique categories: ['Y' 'N' 'P']
Column 'PoolQC' has unique categories: ['Not Available' 'Ex' 'Fa' 'Gd']
Column 'Fence' has unique categories: ['Not Available' 'MnPrv' 'GdWo' 'GdPrv' 'MnWw']
Column 'MiscFeature' has unique categories: ['Not Available' 'Shed' 'Gar2' 'Othr' 'TenC']
Column 'SaleType' has unique categories: ['WD' 'New' 'COD' 'ConLD' 'ConLI' 'CWD' 'ConLw' 'Con' 'Oth']
Column 'SaleCondition' has unique categories: ['Normal' 'Abnorml' 'Partial' 'AdjLand' 'Alloca' 'Family']
In [9]:
# Calculating missing value for all of the columns
missing_values = data.isna().sum()/data.shape[0]

# Filter columns with more than 0 missing values
columns_with_missing = missing_values[missing_values > 0].sort_values(ascending=False)

# Display the filtered result
print(columns_with_missing)
LotFrontage    0.177397
Electrical     0.000685
dtype: float64

The designated columns have been properly adjusted. Now we will go over to handle the remaining missing values in the 'LotFrontage' and 'Electrical' columns. From the information above, 'Electrical' only has 1 observation with missing value.

2.3.3 Removing observation with missing value in "LotFrontage" columns

In [10]:
data.dropna(subset=['LotFrontage'], inplace=True)

2.3.4 Imputing observation with missing value in 'Electrical' column

In [11]:
# Checking for the 'Electrical' column unique values
data['Electrical'].unique()
Out[11]:
array(['SBrkr', 'FuseF', 'FuseA', 'FuseP', 'Mix', nan], dtype=object)

'Electrical' column apparently has 5 different categories that describe the electrical system type for each observation. This info later confirmed by crosschecking to the dataset description.

We will build KNN Model to predict what kind of electrical system the observation with missing value in that column have.

KNN model will be built in the following sequences:

  1. Observation with NA value will be isolated for the prediction making. The KNN model will be trained using the remaining data that has the 'Electrical' output label.
  2. Training data (Subsetted to exclude the NA values) is then splitted into training and test set. Training set of numeric variables will be used to train the model while test set will be used to evaluate the model performance on several different k values.
  3. The best model with optimum k values will then be used to predict the 'Electrical' column output in the observation with missing value.

2.3.5 Subsetting the dataset to numeric variables only and add response column (Electrical)

In [12]:
# Creating master data for knn modeling
knn = data.select_dtypes(include=['int64', 'float64'])
knn.drop('Id', axis=1, inplace=True)
knn['Electrical'] = data['Electrical']

2.3.6 Separating the dataset from the observation with electrical missing value

In [13]:
# Creating subset of dataset to be predicted
electrical_missing = knn[knn['Electrical'].isna()]
electrical_missing.drop('Electrical', axis=1, inplace=True)
In [14]:
knn.dropna(subset=['Electrical'], inplace=True)

2.3.7 Preparing dataset for KNN Model building

In [15]:
knn['Electrical'].value_counts()
Out[15]:
SBrkr    1091
FuseA      80
FuseF      25
FuseP       3
Mix         1
Name: Electrical, dtype: int64

Since 'Mix' category within the Electrical column only has 1 observation, it will not be suitable for data partitioning for model building. For this analysis, we will drop the observation with 'Electrical' value == 'Mix'.

It is also noteworthy that the naive-rate for this classification process is around 90% that will predict the SBrkr as the output label for the missing value.

In [16]:
knn = knn[knn['Electrical'] != 'Mix']
In [17]:
# Subsetting the predictor and response columns
X = knn.drop('Electrical', axis=1)
y = knn['Electrical']

2.3.8 Training KNN Model on Training Dataset

In [18]:
## Partitioning dataset into training and test set
X_train,X_test,y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=256, stratify=y)
In [19]:
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
In [20]:
# Training KNN Model with initial k=3
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train_sc,y_train)
Out[20]:
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=3)

2.3.9 Finding the optimum number of K through Accuracy measure

In [21]:
k_values = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
accuracy_results = []

for k in k_values:
    classifier = KNeighborsClassifier(n_neighbors=k, p=2, metric='euclidean')
    classifier.fit(X_train_sc, y_train)
    y_pred = classifier.predict(X_test_sc)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_results.append((k, accuracy))

print(accuracy_results)
[(3, 0.8805555555555555), (5, 0.8861111111111111), (7, 0.8972222222222223), (9, 0.9083333333333333), (11, 0.9138888888888889), (13, 0.9138888888888889), (15, 0.9138888888888889), (17, 0.9138888888888889), (19, 0.9138888888888889), (21, 0.9138888888888889)]

The accuracy measures returned that k=11 is the best k with the highest accuracy at 0.91. We will build the KNN model with k=11 and make prediction for the missing value.

In [22]:
classifier = KNeighborsClassifier(n_neighbors=11)
classifier.fit(X_train_sc,y_train)
Out[22]:
KNeighborsClassifier(n_neighbors=11)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=11)

2.3.10 Making prediction on the missing value of Electrical column and imputing it to the original training dataset

In [23]:
scaler = StandardScaler()
electrical_missing_sc = scaler.fit_transform(electrical_missing)
y_pred_electrical = classifier.predict(electrical_missing_sc)
In [24]:
y_pred_electrical
Out[24]:
array(['SBrkr'], dtype=object)
In [25]:
data['Electrical'].fillna('SBrkr', inplace=True)
In [26]:
data.isna().sum().any()
Out[26]:
False

All missing values have been imputed and cleaned ready for the Multiple Linear Regression modeling to predict the sales price of each of the houses.

3. Preparing Training Dataset for EDA and MLR Modeling¶

3.1 Setting Id Column as Index¶

In [27]:
data.set_index('Id', inplace=True)

3.2 Subsetting the Training Dataset into Features and Response¶

In [28]:
# Features
X = data.drop(columns=['SalePrice'])
# Response
y = data['SalePrice']

3.3 Grouping The Numeric and Categorical Features For Ease of Analysis¶

In [29]:
numeric_features = X.select_dtypes(include=['int64', 'float64'])
categorical_features = X.select_dtypes(include=['object'])

Lets check whether the features have been properly grouped or not.

In [30]:
categorical_features.head()
Out[30]:
MSZoning Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Heating HeatingQC CentralAir Electrical KitchenQual Functional FireplaceQu GarageType GarageFinish GarageQual GarageCond PavedDrive PoolQC Fence MiscFeature SaleType SaleCondition
Id
1 RL Pave Not Available Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA No GLQ Unf GasA Ex Y SBrkr Gd Typ Not Available Attchd RFn TA TA Y Not Available Not Available Not Available WD Normal
2 RL Pave Not Available Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story Gable CompShg MetalSd MetalSd None TA TA CBlock Gd TA Gd ALQ Unf GasA Ex Y SBrkr TA Typ TA Attchd RFn TA TA Y Not Available Not Available Not Available WD Normal
3 RL Pave Not Available IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA Mn GLQ Unf GasA Ex Y SBrkr Gd Typ TA Attchd RFn TA TA Y Not Available Not Available Not Available WD Normal
4 RL Pave Not Available IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story Gable CompShg Wd Sdng Wd Shng None TA TA BrkTil TA Gd No ALQ Unf GasA Gd Y SBrkr Gd Typ Gd Detchd Unf TA TA Y Not Available Not Available Not Available WD Abnorml
5 RL Pave Not Available IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story Gable CompShg VinylSd VinylSd BrkFace Gd TA PConc Gd TA Av GLQ Unf GasA Ex Y SBrkr Gd Typ TA Attchd RFn TA TA Y Not Available Not Available Not Available WD Normal

Categorical features seem to be properly grouped after cross-checking it with the dataset description, now let`s check for the numeric features.

In [31]:
numeric_features.head()
Out[31]:
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
Id
1 60 65.0 8450 7 5 2003 2003 196.0 706 0 150 856 856 854 0 1710 1 0 2 1 3 1 8 0 2003.0 2 548 0 61 0 0 0 0 0 2 2008
2 20 80.0 9600 6 8 1976 1976 0.0 978 0 284 1262 1262 0 0 1262 0 1 2 0 3 1 6 1 1976.0 2 460 298 0 0 0 0 0 0 5 2007
3 60 68.0 11250 7 5 2001 2002 162.0 486 0 434 920 920 866 0 1786 1 0 2 1 3 1 6 1 2001.0 2 608 0 42 0 0 0 0 0 9 2008
4 70 60.0 9550 7 5 1915 1970 0.0 216 0 540 756 961 756 0 1717 1 0 1 0 3 1 7 1 1998.0 3 642 0 35 272 0 0 0 0 2 2006
5 60 84.0 14260 8 5 2000 2000 350.0 655 0 490 1145 1145 1053 0 2198 1 0 2 1 4 1 9 1 2000.0 3 836 192 84 0 0 0 0 0 12 2008

It seems that there are couples of features identified as numeric but are actually categorical/ordinal in nature as per the dataset description. That features are 'MSSubClass', 'OverallQual', 'OverallCond', and 'MoSold'.

For this analysis, 'MSSubClass', 'OverallQual', 'OverallCond' features will be converted to categorical and moved into categorical_features, while 'MoSold' will be removed as it is redundant information provided by 'YrSold'.

In [32]:
numeric_features[['MSSubClass', 'OverallQual', 'OverallCond']] = numeric_features[['MSSubClass', 'OverallQual', 'OverallCond']].astype('object')
categorical_features[['MSSubClass', 'OverallQual', 'OverallCond']] = numeric_features[['MSSubClass', 'OverallQual', 'OverallCond']]
numeric_features.drop(columns=['MoSold', 'MSSubClass', 'OverallQual', 'OverallCond'], inplace=True)

4. Exploratory Data Analysis¶

Now let us explore the relationship between the features and response to see whether there are any interesting or discernable pattern through the plotting of each of the features against the response.

For this objective, response will be plotted against numeric features and categorical features separately.

4.1 Sale Price vs Numerical Features Relationship¶

4.1.1 Sale Price vs Numerical Features Scatterplot

In [33]:
# Set the number of columns for the grid
num_columns = 4

# Calculate the number of rows needed to accommodate all features
num_features = numeric_features.shape[1]
num_rows = (num_features - 1) // num_columns + 1

# Create subplots with the specified number of rows and columns
fig, axes = plt.subplots(num_rows, num_columns, figsize=(num_columns * 5, num_rows * 5))

# Flatten the axes array if there is only one row
if num_rows == 1:
    axes = [axes]

# Plot pairwise scatterplots
for i, column in enumerate(numeric_features.columns):
    sns.scatterplot(x=column, y='SalePrice', data=numeric_features.join(y), ax=axes[i // num_columns, i % num_columns])

# Remove any empty subplots
for i in range(num_features, num_rows * num_columns):
    fig.delaxes(axes.flatten()[i])

plt.show()

From the plot above, we could see that the majority of the numerical features, although some of the features does not exhibit clear relationship, has somewhat linear relationship with the SalePrice, hinting that Multiple Linear Regression might be the good choice of algorithm to use to predict the response value. We will now check the correlation value between the SalePrice and Numeric Features to confirm the findings.

4.1.2 Sale Price vs Numerical Features Correlation

In [34]:
# Calculate the correlation matrix between SalePrice and Numerical Features
sale_numfeatures_corr = numeric_features.join(y).corr()
sale_numfeatures_corr = sale_numfeatures_corr['SalePrice'].abs().sort_values(ascending=False).drop('SalePrice')
sale_numfeatures_corr
Out[34]:
GrLivArea        0.703557
GarageCars       0.647302
GarageArea       0.631761
TotalBsmtSF      0.626977
1stFlrSF         0.620159
FullBath         0.566800
YearBuilt        0.538913
TotRmsAbvGrd     0.537215
YearRemodAdd     0.519471
MasVnrArea       0.492084
Fireplaces       0.479353
BsmtFinSF1       0.403195
LotFrontage      0.351799
WoodDeckSF       0.345054
OpenPorchSF      0.326530
LotArea          0.311416
2ndFlrSF         0.308957
HalfBath         0.284770
GarageYrBlt      0.262026
BsmtFullBath     0.237637
BsmtUnfSF        0.215044
BedroomAbvGr     0.164554
EnclosedPorch    0.164004
KitchenAbvGr     0.140626
ScreenPorch      0.121166
PoolArea         0.092924
MiscVal          0.049435
3SsnPorch        0.035755
BsmtHalfBath     0.028450
LowQualFinSF     0.022339
YrSold           0.020601
BsmtFinSF2       0.017682
Name: SalePrice, dtype: float64

The majority of numerical features show a relatively strong linear relationship, either positive or negative, with the SalePrice. However, there are a few features that seem to be indifferent, namely 'PoolArea', 'MiscVal', '3SsnPorch', 'BsmtHalfBath', 'LowQualFinSF', 'YrSold', and 'BsmtFinSF2'. These features will be taken into consideration for the feature selection in the later section.

Now we will investigate visually the relationship between the SalePrice and the categorical features to see whether there are any meaningful relationship between them.

4.2 Sale Price vs Categorical Features Relationship¶

In [35]:
# Set the number of columns for the grid
num_columns = 4

# Calculate the number of rows needed to accommodate all features
cat_features = categorical_features.shape[1]
num_rows = (cat_features - 1) // num_columns + 1

# Create subplots with the specified number of rows and columns
fig, axes = plt.subplots(num_rows, num_columns, figsize=(num_columns * 5, num_rows * 5))

# Flatten the axes array if there is only one row
if num_rows == 1:
    axes = [axes]

# Plot pairwise scatterplots
for i, column in enumerate(categorical_features.columns):
    sns.boxplot(x=column, y='SalePrice', data=categorical_features.join(y), ax=axes[i // num_columns, i % num_columns])

# Remove any empty subplots
for i in range(cat_features, num_rows * num_columns):
    fig.delaxes(axes.flatten()[i])

plt.show()

From the above boxplot we could observe that the majority of the categorical predictors seems to have some positive relationship with the SalePrice indicating that these features could be used to predict the difference in SalePrice given the various categorical features. However there are couple of predictors that might not be as useful to predict the SalePrice due to the feature only have one level or the difference between levels in that particular features does not differ very much, those features are: 'LandSlope' and 'Utilities'. These features will be taken into consideration for feature selection in the later section.

We will now check whether there are heavy correlation between numerical features that might cause multicollinearity when building the Multiple Linear Regression model.

4.3 Correlation Between Numerical Features¶

4.3.1 Pairwise Correlation Between Numerical Features

In [36]:
enable_plotly_in_cell

# Calculate the correlation matrix
correlation_matrix = numeric_features.corr()

# Create the heatmap using Plotly
fig = px.imshow(correlation_matrix, x=numeric_features.columns, y=numeric_features.columns,
                color_continuous_scale='RdBu', title='Correlation Heatmap')
fig.show()

Referring to the "Regression Diagnostics: Identifying Influential Data and Sources of Collinearity" book written by Belsley et al, we could use the correlation threshold of 0.7 to distinguish the independent features that might pose the risk of multicollinearity and based on this threshold, there are couple of features that might pose risk of multicollinearity which are 'GrLivArea' & 'TotRmsAbvGrd', '2ndFlrSF' & 'GrLivArea', 'TotalBsmtSF' & '1stFlrSF', 'GarageCars' & 'GarageArea'.

In order to confirm it, we will conduct Variance Inflation Factor (VIF) to validate the finding.

4.3.2 Variance Inflation Factor (VIF) Between Numerical Features

The Variance Inflation Factor (VIF) is a statistical measure used to assess multicollinearity in a regression model. Multicollinearity occurs when two or more independent variables are highly correlated, making it challenging to interpret their individual effects.

VIF quantifies how much the variance of an estimated regression coefficient increases when a particular predictor is included in the model compared to when it is excluded. A VIF of 1 indicates no multicollinearity, while values above 1 suggest increasing levels of multicollinearity.

The formula for VIF is VIF_i = 1 / (1 - R²_i)

The VIF value represents how much the variance of the coefficient of "X_i" is inflated due to multicollinearity. If there is no multicollinearity, the VIF will be 1, indicating that the variance of the coefficient is not affected by correlations with other predictors. However, as the correlation between "X_i" and other predictors increases, the VIF value will rise, indicating more severe multicollinearity.

Interpretation of VIF values:

  • VIF = 1: No multicollinearity (perfectly uncorrelated with other predictors).
  • VIF between 1 and 5: Low to moderate multicollinearity (usually acceptable).
  • VIF above 5: High multicollinearity (may require further investigation or remediation).

High VIF values indicate that a predictor is highly correlated with other predictors, making it challenging to interpret its independent contribution accurately. In such cases, addressing multicollinearity is important, such as removing correlated predictors or using dimensionality reduction techniques.

We will investigate the VIF for numerical features below.

In [37]:
# Create a DataFrame to store VIF results
vif_data = pd.DataFrame()
vif_data['Feature'] = numeric_features.columns

# Calculate VIF for each feature
vif_data['VIF'] = [variance_inflation_factor(numeric_features.values, i) for i in range(numeric_features.shape[1])]
pd.options.display.float_format = '{:.2f}'.format

# Display VIF results
print(vif_data.sort_values(by='VIF',ascending=False))
          Feature      VIF
9        1stFlrSF      inf
8     TotalBsmtSF      inf
12      GrLivArea      inf
11   LowQualFinSF      inf
10       2ndFlrSF      inf
5      BsmtFinSF1      inf
6      BsmtFinSF2      inf
7       BsmtUnfSF      inf
31         YrSold 19197.98
3    YearRemodAdd 16537.37
2       YearBuilt 12261.87
19   TotRmsAbvGrd    81.39
22     GarageCars    38.76
23     GarageArea    31.94
21    GarageYrBlt    30.19
18   KitchenAbvGr    29.82
17   BedroomAbvGr    29.37
15       FullBath    26.43
0     LotFrontage    14.94
13   BsmtFullBath     3.54
1         LotArea     3.48
16       HalfBath     3.37
20     Fireplaces     2.84
4      MasVnrArea     1.89
24     WoodDeckSF     1.88
25    OpenPorchSF     1.86
26  EnclosedPorch     1.45
28    ScreenPorch     1.22
14   BsmtHalfBath     1.19
29       PoolArea     1.16
30        MiscVal     1.09
27      3SsnPorch     1.04

VIF analysis showed that there are multiple of numerical features that exhibited very high VIF thus indicating high risk of multicollinearity when loaded into Multiple Linear Regression Model. Using threshold of 10 for VIF, these features poses high risk of multicollinearity: 1stFlrSF, TotalBsmtSF, GrLivArea, LowQualFinSF, 2ndFlrSF, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, YrSold, YearRemodAdd, YearBuilt, TotRmsAbvGrd, GarageCars, GarageArea, GarageYrBlt, KitchenAbvGr, BedroomAbvGr, FullBath, LotFrontage

Combining this information with the information from the pairwise correlation and the visual relation between response and features, I will remove these numerical features and re-perform VIF before loading it up into the MLR model: 'YrSold', 'LowQualFinSF', 'GrLivArea', 'TotRmsAbvGrd', 'TotalBsmtSF', '1stFlrSF','2ndFlrSF', 'BsmtFinSF2', 'GarageCars', 'YearRemodAdd', 'YearBuilt', and 'GarageYrBlt'.

We will also perform Stepwise Regression to also perform additional feature selection after fitting the dataset into MLR in later section.

In [38]:
numeric_features_new = numeric_features.drop(columns=['YrSold', 'LowQualFinSF', 'GrLivArea', 'TotRmsAbvGrd', 
                                                      'TotalBsmtSF', '1stFlrSF','2ndFlrSF', 'BsmtFinSF2', 
                                                      'GarageCars','YearRemodAdd','GarageYrBlt','YearBuilt'])
In [39]:
# Create a DataFrame to store VIF results
vif_data = pd.DataFrame()
vif_data['Feature'] = numeric_features_new.columns

# Calculate VIF for each feature
vif_data['VIF'] = [variance_inflation_factor(numeric_features_new.values, i) for i in range(numeric_features_new.shape[1])]
pd.options.display.float_format = '{:.2f}'.format

# Display VIF results
print(vif_data.sort_values(by='VIF',ascending=False))
          Feature   VIF
9    BedroomAbvGr 17.53
7        FullBath 14.63
10   KitchenAbvGr 14.47
0     LotFrontage 13.55
12     GarageArea  9.24
3      BsmtFinSF1  5.37
4       BsmtUnfSF  5.32
1         LotArea  3.34
5    BsmtFullBath  3.18
11     Fireplaces  2.49
8        HalfBath  1.90
13     WoodDeckSF  1.83
14    OpenPorchSF  1.76
2      MasVnrArea  1.73
15  EnclosedPorch  1.22
17    ScreenPorch  1.19
6    BsmtHalfBath  1.15
18       PoolArea  1.09
19        MiscVal  1.07
16      3SsnPorch  1.04

The VIF value for the selected numerical features seems low enough compared to the previous VIF value before the removal of the high VIF value features. For this analysis, I will use these combination of features to load into MLR and perform further feature selection using Stepwise Regression.

Please note that for categorical features, the complete feature selection will be done through the Stepwise Regression since there is no straightforward method that is analogous to the VIF in numerical features.

5. Multiple Linear Regression Modeling¶

We will now fit the MLR model using the dataset with features preliminary selected during the previous analysis and check for the model performance. For this analysis, this model will be referred to as base model

5.1 Preparing Dataset for MLR Modeling Given the Previous Analysis¶

In [40]:
# Defining Features
numeric_features.reset_index(drop=True, inplace=True)
categorical_features.reset_index(drop=True, inplace=True)
X = pd.concat([numeric_features, categorical_features], axis=1)
In [41]:
# Removing features as per previous analysis
X = X.drop(columns=['YrSold', 'LowQualFinSF', 'GrLivArea', 'TotRmsAbvGrd',
                       'TotalBsmtSF', '1stFlrSF','2ndFlrSF', 'BsmtFinSF2', 
                       'GarageCars','YearRemodAdd','GarageYrBlt','YearBuilt','Utilities', 'LandSlope'])
In [42]:
# Dummifying Categorical Features
X = pd.get_dummies(X, drop_first=True)
In [43]:
# Checking the number of Features after Dummification
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1201 entries, 0 to 1200
Columns: 264 entries, LotFrontage to OverallCond_9
dtypes: float64(2), int64(18), uint8(244)
memory usage: 474.0 KB
In [44]:
# Splitting the Features and Response into 70% Training and 30% Test Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=296)

5.2 Fitting the MLR using Training Set¶

In [45]:
# Training the MLR model on training set
base_model = LinearRegression()
base_model.fit(X_train,y_train)
Out[45]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [46]:
# Creating Prediction on test set
y_pred_base_model = base_model.predict(X_test)

5.3 Evaluating Model Performance¶

In [47]:
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_base_model)
mse = mean_squared_error(y_test, y_pred_base_model)
rmse = mean_squared_error(y_test, y_pred_base_model, squared=False)
r2 = r2_score(y_test, y_pred_base_model)

# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 24125.98192970704
Mean Squared Error (MSE): 3411215062.296772
Root Mean Squared Error (RMSE): 58405.60814080076
R-squared (R²): 0.474477543732041

The Multiple Linear Regression fitted on the selected numerical features previously screened using VIF and the rest of the categorical features performed quite decent but not quite satisfactory as it could only explain around 47% of the variability within the dataset.

We will now try to perform further feature selection using Stepwise Regression (Forward and Backward) using statsmodel and mlextend package in an attempt to decrease the RMSE as well as increasing the R-Squared.

6. Forward Stepwise Regression for Feature Selections¶

Stepwise regression is a feature selection technique used to build a regression model by iteratively selecting or removing features from the model based on certain statistical criteria. It aims to find the best subset of features that provides the most accurate predictions while avoiding overfitting.

There are two main types of stepwise regression: forward stepwise regression and backward stepwise regression.

1. Forward Stepwise Regression:

  • Start with an empty model that includes only the intercept.

  • Add one feature at a time to the model, selecting the feature that results in the best improvement in model fit (e.g., higher R-squared or lower error).

  • Continue adding features until a stopping criterion is met (e.g., no more statistically significant features to add or a predefined number of features is reached).

2. Backward Stepwise Regression:

  • Start with a full model that includes all features.

  • Remove one feature at a time from the model, selecting the feature to remove based on the least contribution to model fit (e.g., higher p-values or lower reduction in model performance).

  • Continue removing features until a stopping criterion is met (e.g., all remaining features are statistically significant or a predefined number of features is reached).

Both forward and backward stepwise regression use some statistical measure (e.g., adjusted R-squared, AIC, BIC, p-values) to evaluate the model's fit and the importance of each feature. The goal is to find a balance between model complexity and accuracy, ensuring that the final model has the best predictive power without unnecessary overfitting.

Please note that for this analysis, I will use R-squared as the statistical measure and significance level of 0.05 to define the selection on the subset of features and also use the Forward Stepwise Regression.

6.1 Forward Stepwise Regression using statsmodel Package

The MLR model built using the features selected from the Stepwise Regression using statsmodel package will be referred to as model_sm

In [48]:
# Reset the indices of X_train and y_train
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

# Add a constant column to X for the intercept term
X_train = sm.add_constant(X_train)

# Stepwise forward feature selection
def forward_feature_selection(X, y, significance_level=0.05):
    # Setting the initial condition
    remaining_features = set(X.columns)
    selected_features = []
    current_score = 0.0
    
    # Looping through all of the available predictors
    while len(remaining_features) > 0:
        scores_with_candidates = []
        # Fitting the linear regression and store the adjusted R-squared value with all available predictors one by one
        for feature in remaining_features:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[selected_features + [feature]]))).fit()
            score = model.rsquared
            scores_with_candidates.append((score, feature))
        
        # Sorting the score
        scores_with_candidates.sort(reverse=True)
        best_new_score, best_feature = scores_with_candidates.pop(0)
        
        # Choosing the combination of predictors based on the highest adjusted R-squared score
        if current_score < best_new_score:
            remaining_features.remove(best_feature)
            selected_features.append(best_feature)
            current_score = best_new_score
        else:
            break
    
    # Fitting the linear regression model with the best combination of features
    model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[selected_features]))).fit()
    return model, selected_features

**Please note that before running below code cells, the time taken to complete the computation was around 1626 seconds ~ 28 minutes**

In [49]:
# Record the start time
start_time = time.time()

# Perform forward stepwise regression on training set
model_forward, selected_features_forward = forward_feature_selection(X_train, y_train)

# Record the end time
end_time = time.time()

# Calculate the elapsed time
elapsed_time = end_time - start_time
print("")
print(f"Elapsed time: {elapsed_time:.6f} seconds")
Elapsed time: 1626.735968 seconds
In [50]:
# Print results
print("\nModel Summary:")
print(model_forward.summary())

print("")

print("Selected Features in Forward Stepwise Regression:")
print("")
print(selected_features_forward)
Model Summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.924
Method:                 Least Squares   F-statistic:                     42.21
Date:                Tue, 08 Aug 2023   Prob (F-statistic):          3.94e-271
Time:                        16:52:57   Log-Likelihood:                -9495.8
No. Observations:                 840   AIC:                         1.949e+04
Df Residuals:                     593   BIC:                         2.065e+04
Df Model:                         246                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                       -3.68e+06   7.33e+05     -5.022      0.000   -5.12e+06   -2.24e+06
GarageArea                    34.6927      8.254      4.203      0.000      18.481      50.904
ExterQual_TA               -4.024e+04   8722.476     -4.613      0.000   -5.74e+04   -2.31e+04
Fireplaces                  7257.9277   4021.812      1.805      0.072    -640.801    1.52e+04
OverallQual_10               4.91e+04      4e+04      1.228      0.220   -2.94e+04    1.28e+05
OverallQual_9               1.295e+04   3.86e+04      0.335      0.738   -6.29e+04    8.88e+04
OverallQual_8              -3034.4679    3.8e+04     -0.080      0.936   -7.76e+04    7.15e+04
FullBath                    1.885e+04   3186.617      5.914      0.000    1.26e+04    2.51e+04
BsmtFinSF1                    56.6231      5.956      9.506      0.000      44.925      68.321
Condition2_PosN            -4.485e+05   4.61e+04     -9.727      0.000   -5.39e+05   -3.58e+05
LotArea                        1.4904      0.331      4.501      0.000       0.840       2.141
HalfBath                    1.049e+04   2912.818      3.601      0.000    4769.542    1.62e+04
SaleType_New                2.092e+04   1.84e+04      1.136      0.257   -1.53e+04    5.71e+04
OverallQual_7              -2.173e+04   3.78e+04     -0.576      0.565   -9.59e+04    5.24e+04
Neighborhood_NoRidge        2.866e+04    1.2e+04      2.383      0.017    5038.792    5.23e+04
BsmtUnfSF                     37.1896      5.438      6.839      0.000      26.509      47.870
GarageType_BuiltIn          5.568e+04   1.85e+04      3.011      0.003    1.94e+04     9.2e+04
Neighborhood_Crawfor        8760.5370   1.22e+04      0.718      0.473   -1.52e+04    3.27e+04
BsmtFinType2_Unf           -3.216e+04   8939.675     -3.597      0.000   -4.97e+04   -1.46e+04
MSSubClass_60               1243.3843      1e+04      0.124      0.902   -1.85e+04     2.1e+04
OverallCond_7               1.167e+04   2.06e+04      0.565      0.572   -2.89e+04    5.22e+04
BsmtExposure_Gd             1.704e+04   4536.787      3.756      0.000    8130.382     2.6e+04
OverallCond_3              -2.712e+04   2.25e+04     -1.206      0.228   -7.13e+04     1.7e+04
KitchenAbvGr               -1.235e+04   8545.471     -1.445      0.149   -2.91e+04    4434.731
Neighborhood_StoneBr        3.073e+04   1.18e+04      2.603      0.009    7540.332    5.39e+04
Exterior1st_BrkFace         2.423e+04   1.67e+04      1.447      0.148   -8650.360    5.71e+04
OverallCond_8               1.268e+04   2.09e+04      0.606      0.545   -2.84e+04    5.38e+04
PoolArea                    7122.8701   1343.908      5.300      0.000    4483.472    9762.269
MasVnrArea                    20.3662      8.004      2.544      0.011       4.646      36.086
PoolQC_Not Available        3.711e+06   7.28e+05      5.097      0.000    2.28e+06    5.14e+06
PoolQC_Gd                  -3.402e+05   6.02e+04     -5.653      0.000   -4.58e+05   -2.22e+05
Foundation_CBlock           -498.9458   4868.473     -0.102      0.918   -1.01e+04    9062.601
OverallCond_6               3053.2308   2.06e+04      0.148      0.882   -3.74e+04    4.35e+04
BsmtFinType1_GLQ            1.066e+04   3778.382      2.822      0.005    3242.398    1.81e+04
BsmtQual_Gd                -1.503e+04   4692.259     -3.204      0.001   -2.42e+04   -5818.647
SaleCondition_Normal        9628.4954   4231.520      2.275      0.023    1317.907    1.79e+04
MSSubClass_70              -6981.5219   1.16e+04     -0.602      0.547   -2.97e+04    1.58e+04
BsmtFinType2_GLQ            9710.9154   1.23e+04      0.790      0.430   -1.44e+04    3.38e+04
Street_Pave                 2.345e+04    2.4e+04      0.978      0.328   -2.36e+04    7.05e+04
ExterQual_Gd               -3.842e+04   7791.062     -4.931      0.000   -5.37e+04   -2.31e+04
ExterQual_Fa               -4.328e+04   2.08e+04     -2.077      0.038   -8.42e+04   -2360.979
SaleCondition_Alloca         2.01e+04   1.26e+04      1.597      0.111   -4617.940    4.48e+04
LandContour_Low            -3.626e+04   1.36e+04     -2.661      0.008    -6.3e+04   -9494.401
RoofMatl_WdShngl            6.847e+04   1.71e+04      3.997      0.000    3.48e+04    1.02e+05
MasVnrType_Stone            1.271e+04   1.08e+04      1.180      0.238   -8441.470    3.39e+04
LotFrontage                  156.3754     67.226      2.326      0.020      24.345     288.406
LotConfig_CulDSac           1.969e+04   6212.621      3.169      0.002    7488.533    3.19e+04
MSSubClass_30              -6150.8261   6753.588     -0.911      0.363   -1.94e+04    7113.035
Neighborhood_Gilbert       -1.984e+04   1.09e+04     -1.814      0.070   -4.13e+04    1641.737
Exterior2nd_VinylSd         2997.3119   1.71e+04      0.175      0.861   -3.06e+04    3.66e+04
MSZoning_FV                 5.192e+04    1.9e+04      2.733      0.006    1.46e+04    8.92e+04
WoodDeckSF                    16.7247      9.502      1.760      0.079      -1.936      35.385
ScreenPorch                   32.7795     19.441      1.686      0.092      -5.401      70.960
Exterior1st_WdShing        -1.528e+04    1.7e+04     -0.897      0.370   -4.87e+04    1.82e+04
Neighborhood_SawyerW       -2235.8362   1.09e+04     -0.205      0.838   -2.37e+04    1.92e+04
Neighborhood_NridgHt       -3267.5009   1.03e+04     -0.317      0.751   -2.35e+04     1.7e+04
OverallCond_9               3.341e+04   2.48e+04      1.349      0.178   -1.52e+04    8.21e+04
Condition2_PosA            -4.328e+04   5.02e+04     -0.862      0.389   -1.42e+05    5.54e+04
Neighborhood_BrkSide        -1.43e+04   1.37e+04     -1.047      0.296   -4.11e+04    1.25e+04
Condition1_Norm             8529.0876   6311.433      1.351      0.177   -3866.393    2.09e+04
BedroomAbvGr                4393.3319   1894.693      2.319      0.021     672.206    8114.458
BsmtQual_TA                -1.159e+04   5937.657     -1.952      0.051   -2.33e+04      68.165
BsmtQual_Fa                -1.687e+04   9015.455     -1.871      0.062   -3.46e+04     837.097
RoofMatl_Tar&Grv           -3239.4401   2.76e+04     -0.117      0.907   -5.75e+04     5.1e+04
RoofStyle_Mansard            4.41e+04   3.46e+04      1.274      0.203   -2.39e+04    1.12e+05
Neighborhood_SWISU         -9792.0620   1.42e+04     -0.688      0.492   -3.77e+04    1.81e+04
Neighborhood_CollgCr       -1.266e+04   1.01e+04     -1.254      0.210   -3.25e+04    7172.089
BsmtFinType2_BLQ           -2.822e+04    1.1e+04     -2.569      0.010   -4.98e+04   -6642.014
BsmtCond_TA                 6863.3447   6153.101      1.115      0.265   -5221.177    1.89e+04
Neighborhood_Veenker         745.7354   1.56e+04      0.048      0.962      -3e+04    3.15e+04
Exterior1st_CemntBd        -1.436e+04    2.7e+04     -0.531      0.595   -6.74e+04    3.87e+04
Exterior1st_HdBoard        -3135.9166   1.67e+04     -0.187      0.852    -3.6e+04    2.98e+04
BsmtFinType2_Rec            -1.78e+04   1.02e+04     -1.752      0.080   -3.78e+04    2152.522
Foundation_Wood            -5.411e+04   2.16e+04     -2.503      0.013   -9.66e+04   -1.17e+04
GarageType_Basment          6.042e+04   1.98e+04      3.050      0.002    2.15e+04    9.93e+04
Exterior2nd_Plywood        -1083.7407   1.56e+04     -0.070      0.945   -3.17e+04    2.95e+04
KitchenQual_Gd             -1.718e+04   4592.874     -3.740      0.000   -2.62e+04   -8155.534
KitchenQual_TA             -1.634e+04   5253.060     -3.111      0.002   -2.67e+04   -6027.514
KitchenQual_Fa             -2.452e+04   8435.283     -2.907      0.004   -4.11e+04   -7955.132
SaleType_ConLD              3.122e+04   1.61e+04      1.944      0.052    -322.157    6.28e+04
MSSubClass_190             -3.106e+04   3.64e+04     -0.853      0.394   -1.03e+05    4.04e+04
Heating_Grav               -1.819e+04   1.95e+04     -0.931      0.352   -5.66e+04    2.02e+04
GarageQual_Fa               2.958e+04   3.61e+04      0.818      0.413   -4.14e+04    1.01e+05
3SsnPorch                    100.7975     46.202      2.182      0.030      10.057     191.538
Functional_Min2             -2.04e+04   1.31e+04     -1.553      0.121   -4.62e+04    5402.589
MSSubClass_90              -1.386e+04   1.06e+04     -1.302      0.194   -3.48e+04    7052.677
MSSubClass_50              -9178.6194   1.22e+04     -0.750      0.454   -3.32e+04    1.49e+04
Neighborhood_NPkVill       -6316.1920   2.09e+04     -0.302      0.762   -4.73e+04    3.47e+04
Neighborhood_BrDale         6847.6949   1.65e+04      0.415      0.678   -2.55e+04    3.92e+04
LotConfig_FR3              -2.077e+04    1.8e+04     -1.151      0.250   -5.62e+04    1.47e+04
Exterior2nd_Other          -3.129e+04   3.07e+04     -1.020      0.308   -9.15e+04     2.9e+04
Exterior1st_ImStucc        -3.629e+04   3.37e+04     -1.077      0.282   -1.02e+05    2.99e+04
GarageCond_TA              -1.475e+04   4.54e+04     -0.325      0.745   -1.04e+05    7.44e+04
Neighborhood_Edwards         -2.9e+04   1.12e+04     -2.590      0.010    -5.1e+04   -7007.572
LotConfig_FR2              -4632.7025   6560.036     -0.706      0.480   -1.75e+04    8251.028
PavedDrive_P               -8214.6972   9286.426     -0.885      0.377   -2.65e+04       1e+04
SaleCondition_AdjLand       2.183e+04   2.08e+04      1.051      0.294    -1.9e+04    6.26e+04
BsmtCond_Gd                 2241.2995   8101.576      0.277      0.782   -1.37e+04    1.82e+04
RoofStyle_Gambrel           4.598e+04   3.37e+04      1.366      0.172   -2.01e+04    1.12e+05
SaleType_Oth                1.321e+04   1.66e+04      0.796      0.427   -1.94e+04    4.58e+04
MSSubClass_85               -1.87e+04   1.61e+04     -1.165      0.245   -5.02e+04    1.28e+04
LotShape_Reg                2100.5104   2529.498      0.830      0.407   -2867.353    7068.374
Functional_Min1            -1.964e+04   1.38e+04     -1.426      0.154   -4.67e+04    7401.649
SaleType_CWD                1.775e+04   1.62e+04      1.096      0.273   -1.41e+04    4.96e+04
Functional_Maj2            -2.949e+04      2e+04     -1.477      0.140   -6.87e+04    9713.068
Exterior2nd_AsphShn        -1.572e+04   2.72e+04     -0.578      0.563   -6.91e+04    3.77e+04
Neighborhood_Sawyer        -1.123e+04   1.17e+04     -0.963      0.336   -3.42e+04    1.17e+04
Condition1_RRAe            -1.439e+04   1.18e+04     -1.221      0.223   -3.75e+04    8754.700
BsmtFinType1_Rec           -2881.5038   4605.507     -0.626      0.532   -1.19e+04    6163.585
Heating_Wall                -1.47e+04   3.04e+04     -0.483      0.629   -7.44e+04     4.5e+04
HeatingQC_TA               -5913.9539   3130.203     -1.889      0.059   -1.21e+04     233.678
Exterior1st_Stucco           1.14e+04   1.99e+04      0.573      0.567   -2.77e+04    5.05e+04
GarageQual_Gd               5.012e+04   3.81e+04      1.314      0.189   -2.48e+04    1.25e+05
GarageType_Detchd           4.671e+04   1.77e+04      2.643      0.008     1.2e+04    8.14e+04
FireplaceQu_Po             -1.636e+04   1.16e+04     -1.405      0.161   -3.92e+04    6511.465
Electrical_SBrkr           -3164.3123   4386.739     -0.721      0.471   -1.18e+04    5451.123
Exterior1st_Plywood        -1577.7946   1.64e+04     -0.096      0.923   -3.38e+04    3.06e+04
RoofMatl_Membran            7.621e+04   4.54e+04      1.678      0.094    -1.3e+04    1.65e+05
Neighborhood_ClearCr       -3.309e+04    1.4e+04     -2.358      0.019   -6.07e+04   -5527.750
SaleType_Con                1.276e+04   1.97e+04      0.647      0.518    -2.6e+04    5.15e+04
OpenPorchSF                   23.7145     19.005      1.248      0.213     -13.610      61.039
Exterior2nd_Brk Cmn         1.773e+04    2.8e+04      0.633      0.527   -3.72e+04    7.27e+04
Functional_Mod             -2.228e+04   1.66e+04     -1.343      0.180   -5.48e+04    1.03e+04
Neighborhood_Somerst       -2.698e+04   1.28e+04     -2.105      0.036   -5.21e+04   -1812.959
Neighborhood_Timber        -2.142e+04   1.15e+04     -1.868      0.062   -4.39e+04    1094.740
HouseStyle_2.5Fin          -1.916e+04   3.08e+04     -0.623      0.534   -7.96e+04    4.13e+04
GarageCond_Fa              -1.274e+04   4.51e+04     -0.283      0.778   -1.01e+05    7.58e+04
MSZoning_RL                 2.102e+04    1.7e+04      1.234      0.218   -1.24e+04    5.45e+04
FireplaceQu_TA             -1.188e+04   8144.896     -1.459      0.145   -2.79e+04    4115.580
Neighborhood_Blueste        1.462e+04    2.4e+04      0.609      0.543   -3.25e+04    6.17e+04
BsmtExposure_Mn            -5161.9351   4492.729     -1.149      0.251    -1.4e+04    3661.661
BsmtExposure_No            -3146.3313   3243.215     -0.970      0.332   -9515.917    3223.254
MSSubClass_80              -3.103e+04    1.7e+04     -1.825      0.068   -6.44e+04    2357.109
BsmtFinType2_Not Available -5.534e+04    2.7e+04     -2.047      0.041   -1.08e+05   -2242.609
BsmtQual_Not Available      4.116e+04   3.15e+04      1.308      0.191   -2.06e+04    1.03e+05
BsmtFinType2_LwQ           -1.492e+04   1.04e+04     -1.430      0.153   -3.54e+04    5576.840
SaleType_ConLw              -2.62e+04   2.02e+04     -1.297      0.195   -6.59e+04    1.35e+04
Condition1_PosN             1.838e+04   1.31e+04      1.402      0.161   -7363.443    4.41e+04
MSZoning_RH                 3.156e+04   2.06e+04      1.530      0.127   -8957.079    7.21e+04
Condition2_RRNn             1.915e+04   4.52e+04      0.423      0.672   -6.97e+04    1.08e+05
HouseStyle_1Story          -1.687e+04   1.19e+04     -1.423      0.155   -4.02e+04    6416.645
MSSubClass_160             -2.627e+04   1.32e+04     -1.996      0.046   -5.21e+04    -415.349
HouseStyle_SFoyer          -1.321e+04   1.69e+04     -0.784      0.433   -4.63e+04    1.99e+04
ExterCond_Fa                1.029e+04   2.55e+04      0.404      0.687   -3.98e+04    6.03e+04
Functional_Typ             -1.376e+04   1.17e+04     -1.176      0.240   -3.67e+04    9209.758
Exterior2nd_CBlock         -1.714e+04   3.43e+04     -0.500      0.617   -8.45e+04    5.02e+04
Neighborhood_MeadowV       -6632.8174   1.66e+04     -0.401      0.689   -3.91e+04    2.59e+04
MSSubClass_45              -1.389e+04   1.62e+04     -0.858      0.391   -4.57e+04    1.79e+04
GarageType_Attchd           4.334e+04   1.77e+04      2.442      0.015    8480.819    7.82e+04
GarageType_CarPort          4.716e+04   2.26e+04      2.086      0.037    2765.401    9.16e+04
GarageType_Not Available    7.418e+04   3.33e+04      2.224      0.027    8683.708     1.4e+05
MSZoning_RM                 1.777e+04   1.61e+04      1.101      0.271   -1.39e+04    4.95e+04
BsmtFinType1_LwQ           -2242.7798   5430.830     -0.413      0.680   -1.29e+04    8423.220
GarageCond_Po               1.721e+04   3.82e+04      0.450      0.653   -5.79e+04    9.23e+04
GarageCond_Gd              -1.187e+04   4.82e+04     -0.246      0.806   -1.07e+05    8.28e+04
Exterior2nd_BrkFace        -5777.9880   1.78e+04     -0.324      0.746   -4.08e+04    2.92e+04
CentralAir_Y                6352.4333   5670.502      1.120      0.263   -4784.276    1.75e+04
Foundation_Stone            1.503e+04   1.71e+04      0.880      0.379   -1.85e+04    4.86e+04
OverallQual_2               -1.92e+04   4.26e+04     -0.451      0.652   -1.03e+05    6.44e+04
Neighborhood_Mitchel       -2.296e+04   1.17e+04     -1.957      0.051    -4.6e+04      79.189
Exterior2nd_ImStucc         5594.7596   2.04e+04      0.275      0.783   -3.44e+04    4.56e+04
Heating_GasW                9018.8912   9742.849      0.926      0.355   -1.01e+04    2.82e+04
Foundation_PConc            2794.4173   5118.928      0.546      0.585   -7259.017    1.28e+04
LotShape_IR3                7617.4981   1.44e+04      0.531      0.596   -2.06e+04    3.58e+04
FireplaceQu_Fa             -1.093e+04   1.02e+04     -1.073      0.284   -3.09e+04    9078.362
Foundation_Slab             6343.4858   1.48e+04      0.428      0.669   -2.28e+04    3.55e+04
MasVnrType_Not Available   -4304.6510   1.55e+04     -0.278      0.781   -3.47e+04    2.61e+04
OverallQual_5              -3.652e+04   3.77e+04     -0.969      0.333   -1.11e+05    3.75e+04
FireplaceQu_Gd             -7949.3806   7857.964     -1.012      0.312   -2.34e+04    7483.445
FireplaceQu_Not Available  -7504.2968   9498.229     -0.790      0.430   -2.62e+04    1.11e+04
Condition2_Norm            -5383.5224   3.54e+04     -0.152      0.879    -7.5e+04    6.42e+04
OverallQual_3              -3.947e+04   3.76e+04     -1.051      0.294   -1.13e+05    3.43e+04
MSSubClass_180             -1.749e+04   1.82e+04     -0.962      0.336   -5.32e+04    1.82e+04
PavedDrive_Y                3985.3905   5088.038      0.783      0.434   -6007.376     1.4e+04
BsmtHalfBath               -1053.8131   4590.969     -0.230      0.819   -1.01e+04    7962.724
MSSubClass_75              -8260.0142   3.59e+04     -0.230      0.818   -7.87e+04    6.22e+04
RoofMatl_Roll              -1.416e+04   2.87e+04     -0.493      0.622   -7.06e+04    4.23e+04
GarageQual_TA               3.526e+04   3.66e+04      0.965      0.335   -3.65e+04    1.07e+05
Neighborhood_IDOTRR        -1.696e+04   1.55e+04     -1.095      0.274   -4.74e+04    1.35e+04
Electrical_FuseP            1.369e+04   2.26e+04      0.604      0.546   -3.08e+04    5.82e+04
Electrical_Mix              1.973e+04   4.55e+04      0.434      0.665   -6.96e+04    1.09e+05
HeatingQC_Gd               -1537.6994   3139.766     -0.490      0.624   -7704.114    4628.715
Neighborhood_NAmes         -1.901e+04    1.1e+04     -1.725      0.085   -4.07e+04    2638.468
Neighborhood_NWAmes        -1.758e+04   1.17e+04     -1.503      0.133   -4.06e+04    5398.718
Neighborhood_OldTown       -2.134e+04   1.36e+04     -1.574      0.116    -4.8e+04    5285.192
MSSubClass_120             -4919.5792   5835.088     -0.843      0.400   -1.64e+04    6540.373
MasVnrType_None             1877.7729   1.03e+04      0.183      0.855   -1.83e+04     2.2e+04
LotShape_IR2               -4161.3813   7913.901     -0.526      0.599   -1.97e+04    1.14e+04
Electrical_FuseF            1640.2051   9871.385      0.166      0.868   -1.77e+04     2.1e+04
LotConfig_Inside            1312.2090   2789.788      0.470      0.638   -4166.857    6791.275
HouseStyle_SLvl             7730.2562   1.87e+04      0.414      0.679    -2.9e+04    4.44e+04
Condition1_RRNn             1.094e+04   2.31e+04      0.473      0.637   -3.45e+04    5.64e+04
Fence_MnWw                  7042.9008   1.18e+04      0.595      0.552   -1.62e+04    3.03e+04
RoofStyle_Gable             2.333e+04   3.12e+04      0.747      0.455    -3.8e+04    8.47e+04
RoofStyle_Hip               2.275e+04   3.14e+04      0.725      0.469   -3.89e+04    8.44e+04
Exterior2nd_MetalSd        -6478.9963   1.83e+04     -0.355      0.723   -4.24e+04    2.94e+04
Exterior1st_MetalSd         6606.5230   1.87e+04      0.353      0.724   -3.01e+04    4.33e+04
GarageFinish_RFn           -1108.8461   2939.761     -0.377      0.706   -6882.455    4664.763
Exterior1st_VinylSd        -2319.6144   1.78e+04     -0.130      0.896   -3.73e+04    3.26e+04
SaleCondition_Partial       8491.0255   1.78e+04      0.478      0.633   -2.64e+04    4.34e+04
BsmtFinType1_Unf            2562.9484   4361.698      0.588      0.557   -6003.306    1.11e+04
SaleCondition_Family       -2426.8889   9360.041     -0.259      0.796   -2.08e+04     1.6e+04
Heating_OthW                6511.6278   2.17e+04      0.300      0.764   -3.61e+04    4.91e+04
OverallCond_5              -4417.1357   2.04e+04     -0.216      0.829   -4.46e+04    3.57e+04
MiscFeature_Shed            2800.6196   8250.910      0.339      0.734   -1.34e+04     1.9e+04
BsmtFinType1_BLQ            1630.6622   4153.197      0.393      0.695   -6526.102    9787.426
Exterior2nd_Stone          -9491.1421    3.1e+04     -0.306      0.760   -7.04e+04    5.14e+04
SaleType_ConLI             -4107.2633   1.68e+04     -0.245      0.807   -3.71e+04    2.88e+04
SaleType_WD                -1345.2472   5954.210     -0.226      0.821    -1.3e+04    1.03e+04
LandContour_Lvl            -2009.9204   6245.764     -0.322      0.748   -1.43e+04    1.03e+04
Fence_GdWo                  3860.9759   8073.026      0.478      0.633    -1.2e+04    1.97e+04
BsmtFullBath                1133.4017   2988.423      0.379      0.705   -4735.778    7002.582
Condition1_PosA             4285.8753   1.84e+04      0.232      0.816   -3.19e+04    4.05e+04
Exterior2nd_Wd Shng         2344.0078   1.61e+04      0.146      0.884   -2.93e+04     3.4e+04
MiscVal                       -1.3167      9.170     -0.144      0.886     -19.326      16.693
ExterCond_Gd                2.347e+04   2.32e+04      1.011      0.313   -2.21e+04    6.91e+04
ExterCond_TA                2.309e+04   2.33e+04      0.991      0.322   -2.27e+04    6.89e+04
Fence_Not Available         2671.0758   6026.979      0.443      0.658   -9165.745    1.45e+04
Fence_MnPrv                 2460.5655   6530.466      0.377      0.706   -1.04e+04    1.53e+04
LandContour_HLS            -1311.2213   7976.945     -0.164      0.869    -1.7e+04    1.44e+04
HouseStyle_2.5Unf          -4611.3336   3.82e+04     -0.121      0.904   -7.97e+04    7.04e+04
HeatingQC_Po                6149.3821   3.09e+04      0.199      0.842   -5.46e+04    6.69e+04
Exterior2nd_Wd Sdng         -627.0256   1.57e+04     -0.040      0.968   -3.14e+04    3.01e+04
Condition1_RRNe             3696.5835   2.57e+04      0.144      0.885   -4.67e+04    5.41e+04
Condition2_Feedr            8869.8999   3.89e+04      0.228      0.820   -6.76e+04    8.54e+04
Exterior2nd_Stucco          -129.0002   1.79e+04     -0.007      0.994   -3.54e+04    3.51e+04
Condition1_RRAn             1177.6678      1e+04      0.117      0.907   -1.85e+04    2.09e+04
Condition1_Feedr             438.8624   7746.471      0.057      0.955   -1.48e+04    1.57e+04
HouseStyle_2Story            626.6116   1.09e+04      0.057      0.954   -2.08e+04    2.21e+04
OverallQual_6              -3.426e+04   3.77e+04     -0.908      0.364   -1.08e+05    3.99e+04
OverallQual_4              -3.383e+04   3.75e+04     -0.902      0.367   -1.07e+05    3.98e+04
BldgType_2fmCon             7399.6033   3.52e+04      0.210      0.834   -6.17e+04    7.65e+04
MiscFeature_Othr            6572.0973    2.9e+04      0.226      0.821   -5.04e+04    6.36e+04
Exterior1st_BrkComm         6741.5577   3.86e+04      0.175      0.861   -6.91e+04    8.26e+04
HeatingQC_Fa                -939.0310   7228.257     -0.130      0.897   -1.51e+04    1.33e+04
MSSubClass_40              -2922.1790   2.09e+04     -0.140      0.889   -4.39e+04    3.81e+04
Alley_Pave                  1220.1383   9798.611      0.125      0.901    -1.8e+04    2.05e+04
Exterior1st_Wd Sdng         1981.2473   1.63e+04      0.122      0.903      -3e+04    3.39e+04
OverallCond_4              -2290.6565   2.09e+04     -0.109      0.913   -4.34e+04    3.88e+04
Exterior2nd_HdBoard        -1351.0716   1.64e+04     -0.082      0.934   -3.36e+04    3.09e+04
BldgType_Twnhs               666.3158   6918.098      0.096      0.923   -1.29e+04    1.43e+04
Alley_Not Available          414.5243   6497.530      0.064      0.949   -1.23e+04    1.32e+04
GarageFinish_Unf            -235.5271   3712.022     -0.063      0.949   -7525.836    7054.782
EnclosedPorch                  1.2126     18.825      0.064      0.949     -35.759      38.184
RoofMatl_WdShake            1016.2752   2.72e+04      0.037      0.970   -5.25e+04    5.45e+04
MasVnrType_BrkFace          -250.6285   1.03e+04     -0.024      0.981   -2.06e+04    2.01e+04
Exterior2nd_CmentBd          595.0714    2.6e+04      0.023      0.982   -5.05e+04    5.17e+04
PoolQC_Fa                           0          0        nan        nan           0           0
==============================================================================
Omnibus:                      165.594   Durbin-Watson:                   2.012
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1631.531
Skew:                           0.580   Prob(JB):                         0.00
Kurtosis:                       9.728   Cond. No.                     1.14e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.97e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Selected Features in Forward Stepwise Regression:

['GarageArea', 'ExterQual_TA', 'Fireplaces', 'OverallQual_10', 'OverallQual_9', 'OverallQual_8', 'FullBath', 'BsmtFinSF1', 'Condition2_PosN', 'LotArea', 'HalfBath', 'SaleType_New', 'OverallQual_7', 'Neighborhood_NoRidge', 'BsmtUnfSF', 'GarageType_BuiltIn', 'Neighborhood_Crawfor', 'BsmtFinType2_Unf', 'MSSubClass_60', 'OverallCond_7', 'BsmtExposure_Gd', 'OverallCond_3', 'KitchenAbvGr', 'Neighborhood_StoneBr', 'Exterior1st_BrkFace', 'OverallCond_8', 'PoolArea', 'MasVnrArea', 'PoolQC_Not Available', 'PoolQC_Gd', 'Foundation_CBlock', 'OverallCond_6', 'BsmtFinType1_GLQ', 'BsmtQual_Gd', 'SaleCondition_Normal', 'MSSubClass_70', 'BsmtFinType2_GLQ', 'Street_Pave', 'ExterQual_Gd', 'ExterQual_Fa', 'SaleCondition_Alloca', 'LandContour_Low', 'RoofMatl_WdShngl', 'MasVnrType_Stone', 'LotFrontage', 'LotConfig_CulDSac', 'MSSubClass_30', 'Neighborhood_Gilbert', 'Exterior2nd_VinylSd', 'MSZoning_FV', 'WoodDeckSF', 'ScreenPorch', 'Exterior1st_WdShing', 'Neighborhood_SawyerW', 'Neighborhood_NridgHt', 'OverallCond_9', 'Condition2_PosA', 'Neighborhood_BrkSide', 'Condition1_Norm', 'BedroomAbvGr', 'BsmtQual_TA', 'BsmtQual_Fa', 'RoofMatl_Tar&Grv', 'RoofStyle_Mansard', 'Neighborhood_SWISU', 'Neighborhood_CollgCr', 'BsmtFinType2_BLQ', 'BsmtCond_TA', 'Neighborhood_Veenker', 'Exterior1st_CemntBd', 'Exterior1st_HdBoard', 'BsmtFinType2_Rec', 'Foundation_Wood', 'GarageType_Basment', 'Exterior2nd_Plywood', 'KitchenQual_Gd', 'KitchenQual_TA', 'KitchenQual_Fa', 'SaleType_ConLD', 'MSSubClass_190', 'Heating_Grav', 'GarageQual_Fa', '3SsnPorch', 'Functional_Min2', 'MSSubClass_90', 'MSSubClass_50', 'Neighborhood_NPkVill', 'Neighborhood_BrDale', 'LotConfig_FR3', 'Exterior2nd_Other', 'Exterior1st_ImStucc', 'GarageCond_TA', 'Neighborhood_Edwards', 'LotConfig_FR2', 'PavedDrive_P', 'SaleCondition_AdjLand', 'BsmtCond_Gd', 'RoofStyle_Gambrel', 'SaleType_Oth', 'MSSubClass_85', 'LotShape_Reg', 'Functional_Min1', 'SaleType_CWD', 'Functional_Maj2', 'Exterior2nd_AsphShn', 'Neighborhood_Sawyer', 'Condition1_RRAe', 'BsmtFinType1_Rec', 'Heating_Wall', 'HeatingQC_TA', 'Exterior1st_Stucco', 'GarageQual_Gd', 'GarageType_Detchd', 'FireplaceQu_Po', 'Electrical_SBrkr', 'Exterior1st_Plywood', 'RoofMatl_Membran', 'Neighborhood_ClearCr', 'SaleType_Con', 'OpenPorchSF', 'Exterior2nd_Brk Cmn', 'Functional_Mod', 'Neighborhood_Somerst', 'Neighborhood_Timber', 'HouseStyle_2.5Fin', 'GarageCond_Fa', 'MSZoning_RL', 'FireplaceQu_TA', 'Neighborhood_Blueste', 'BsmtExposure_Mn', 'BsmtExposure_No', 'MSSubClass_80', 'BsmtFinType2_Not Available', 'BsmtQual_Not Available', 'BsmtFinType2_LwQ', 'SaleType_ConLw', 'Condition1_PosN', 'MSZoning_RH', 'Condition2_RRNn', 'HouseStyle_1Story', 'MSSubClass_160', 'HouseStyle_SFoyer', 'ExterCond_Fa', 'Functional_Typ', 'Exterior2nd_CBlock', 'Neighborhood_MeadowV', 'MSSubClass_45', 'GarageType_Attchd', 'GarageType_CarPort', 'GarageType_Not Available', 'MSZoning_RM', 'BsmtFinType1_LwQ', 'GarageCond_Po', 'GarageCond_Gd', 'Exterior2nd_BrkFace', 'CentralAir_Y', 'Foundation_Stone', 'OverallQual_2', 'Neighborhood_Mitchel', 'Exterior2nd_ImStucc', 'Heating_GasW', 'Foundation_PConc', 'LotShape_IR3', 'FireplaceQu_Fa', 'Foundation_Slab', 'MasVnrType_Not Available', 'OverallQual_5', 'FireplaceQu_Gd', 'FireplaceQu_Not Available', 'Condition2_Norm', 'OverallQual_3', 'MSSubClass_180', 'PavedDrive_Y', 'BsmtHalfBath', 'MSSubClass_75', 'RoofMatl_Roll', 'GarageQual_TA', 'Neighborhood_IDOTRR', 'Electrical_FuseP', 'Electrical_Mix', 'HeatingQC_Gd', 'Neighborhood_NAmes', 'Neighborhood_NWAmes', 'Neighborhood_OldTown', 'MSSubClass_120', 'MasVnrType_None', 'LotShape_IR2', 'Electrical_FuseF', 'LotConfig_Inside', 'HouseStyle_SLvl', 'Condition1_RRNn', 'Fence_MnWw', 'RoofStyle_Gable', 'RoofStyle_Hip', 'Exterior2nd_MetalSd', 'Exterior1st_MetalSd', 'GarageFinish_RFn', 'Exterior1st_VinylSd', 'SaleCondition_Partial', 'BsmtFinType1_Unf', 'SaleCondition_Family', 'Heating_OthW', 'OverallCond_5', 'MiscFeature_Shed', 'BsmtFinType1_BLQ', 'Exterior2nd_Stone', 'SaleType_ConLI', 'SaleType_WD', 'LandContour_Lvl', 'Fence_GdWo', 'BsmtFullBath', 'Condition1_PosA', 'Exterior2nd_Wd Shng', 'MiscVal', 'ExterCond_Gd', 'ExterCond_TA', 'Fence_Not Available', 'Fence_MnPrv', 'LandContour_HLS', 'HouseStyle_2.5Unf', 'HeatingQC_Po', 'Exterior2nd_Wd Sdng', 'Condition1_RRNe', 'Condition2_Feedr', 'Exterior2nd_Stucco', 'Condition1_RRAn', 'Condition1_Feedr', 'HouseStyle_2Story', 'OverallQual_6', 'OverallQual_4', 'BldgType_2fmCon', 'MiscFeature_Othr', 'Exterior1st_BrkComm', 'HeatingQC_Fa', 'MSSubClass_40', 'Alley_Pave', 'Exterior1st_Wd Sdng', 'OverallCond_4', 'Exterior2nd_HdBoard', 'BldgType_Twnhs', 'Alley_Not Available', 'GarageFinish_Unf', 'EnclosedPorch', 'RoofMatl_WdShake', 'MasVnrType_BrkFace', 'Exterior2nd_CmentBd', 'PoolQC_Fa']

The number of predictors decreased from previously 264 (after dummification of categorical features) into 246. We will now build another MLR model with the combination of the best features and evaluate the model performance against the test set.

In [51]:
# Subsetting the Predictors using the best combination of features
X_train_sm = X_train[selected_features_forward]
X_test_sm = X_test[selected_features_forward]
In [52]:
model_sm = LinearRegression()
model_sm.fit(X_train_sm,y_train)
Out[52]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [53]:
# Creating Prediction on test set
y_pred_sm = model_sm.predict(X_test_sm)
In [54]:
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_sm)
mse = mean_squared_error(y_test, y_pred_sm)
rmse = mean_squared_error(y_test, y_pred_sm, squared=False)
r2 = r2_score(y_test, y_pred_sm)

# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 24104.967342728156
Mean Squared Error (MSE): 3408762712.2648325
Root Mean Squared Error (RMSE): 58384.610234759915
R-squared (R²): 0.47485534606607094

The MLR model built using features as a result from feature selection of stepwise regression using statsmodel package returned a relatively same model performance. R-Squared value increased only around 0.1% from the model without stepwise regression feature selection.

We will now try to use mlextend package to perform feature selection and compare the model performance.

6.2 Forward Stepwise Regression using mlextend Package

The MLR model built using the features selected from the Stepwise Regression using mlextend package will be referred to as model_ml

**Please note that below code cell takes approximately 893 seconds to run ~ 15 mins**

In [55]:
# Record the start time
start_time = time.time()

# Stepwise forward feature selection
model_forward = LinearRegression()
sfs_forward = SequentialFeatureSelector(model_forward,
                                        k_features='best',
                                        forward=True,
                                        scoring='r2',
                                        cv=5)
sfs_forward.fit(X_train, y_train)
selected_features_forward = list(X_train.columns[list(sfs_forward.k_feature_idx_)])

print("Forward Stepwise Regression:")
print("Selected Features:", selected_features_forward)

# Record the end time
end_time = time.time()

# Calculate the elapsed time
elapsed_time = end_time - start_time
print("")
print(f"Elapsed time: {elapsed_time:.6f} seconds")
Forward Stepwise Regression:
Selected Features: ['const', 'LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageArea', 'WoodDeckSF', '3SsnPorch', 'ScreenPorch', 'MSZoning_FV', 'MSZoning_RL', 'Street_Pave', 'LandContour_Low', 'LotConfig_CulDSac', 'LotConfig_FR2', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAn', 'Condition2_Norm', 'Condition2_PosN', 'Condition2_RRNn', 'BldgType_TwnhsE', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2Story', 'HouseStyle_SLvl', 'RoofStyle_Mansard', 'RoofMatl_WdShngl', 'Exterior1st_BrkFace', 'Exterior1st_Wd Sdng', 'Exterior1st_WdShing', 'Exterior2nd_Brk Cmn', 'Exterior2nd_CmentBd', 'Exterior2nd_Other', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'ExterCond_Fa', 'Foundation_PConc', 'Foundation_Wood', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType2_BLQ', 'BsmtFinType2_GLQ', 'BsmtFinType2_Unf', 'Heating_Grav', 'Heating_Wall', 'HeatingQC_Gd', 'HeatingQC_Po', 'HeatingQC_TA', 'Electrical_FuseF', 'Electrical_Mix', 'KitchenQual_Gd', 'Functional_Min1', 'FireplaceQu_Not Available', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Not Available', 'GarageFinish_Not Available', 'Fence_MnWw', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_New', 'SaleCondition_Normal', 'MSSubClass_30', 'MSSubClass_85', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_190', 'OverallQual_6', 'OverallQual_7', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7']

Elapsed time: 893.863827 seconds
In [56]:
len(selected_features_forward)
Out[56]:
104

The number of predictors decreased from previously 264 (after dummification of categorical features) into 103. We will now build another MLR model with the combination of the best features and evaluate the model performance against the test set.

In [57]:
features = selected_features_forward[1:]
X_train_ml = X_train[features]
X_test_ml = X_test[features]
In [58]:
model_ml = LinearRegression()
model_ml.fit(X_train_ml,y_train)
Out[58]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [59]:
# Creating Prediction on test set
y_pred_ml = model_ml.predict(X_test_ml)
In [60]:
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_ml)
mse = mean_squared_error(y_test, y_pred_ml)
rmse = mean_squared_error(y_test, y_pred_ml, squared=False)
r2 = r2_score(y_test, y_pred_ml)

# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 22863.972546783436
Mean Squared Error (MSE): 2491410398.651759
Root Mean Squared Error (RMSE): 49914.03007824312
R-squared (R²): 0.6161801327795906

The features resulted from the mlextend package stepwise regression returned an even higher R-squared value from the base MLR and the statsmodel feature selection MLR with around 32% increase from the base MLR and 31% from the statsmodel feature selection MLR.

7. Polynomial Multiple Linear Regression¶

Other than the feature selection using stepwise regression and VIF that previously discussed, we will now try to fit another MLR model using Polynomial of degree 2 using all of the features. This is due to the fact that there are couples of numerical predictor that seemingly have quadratic relationship with the SalePrice. This model will be referred to as model_pl.

7.1 Preparing Dataset for Polynomial MLR

In [61]:
quadratic = PolynomialFeatures(2, include_bias=False)
X_train_quadratic = quadratic.fit_transform(X_train.drop(columns='const'))
X_test_quadratic = quadratic.transform(X_test)

7.2 Fitting the Linear Model on the Transformed Features

In [62]:
model_pl = LinearRegression()
model_pl.fit(X_train_quadratic, y_train)
Out[62]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

7.3 Evaluating the Model Performance on Test Set

In [63]:
y_pred_pl = model_pl.predict(X_test_quadratic)
In [64]:
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_pl)
mse = mean_squared_error(y_test, y_pred_pl)
rmse = mean_squared_error(y_test, y_pred_pl, squared=False)
r2 = r2_score(y_test, y_pred_pl)

# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 78113.00830424526
Mean Squared Error (MSE): 195076371198.62335
Root Mean Squared Error (RMSE): 441674.5082055601
R-squared (R²): -29.05293183805185

The model_pl fitted using Polynomial of Degree 2 MLR does not perform very well indicated by the much higher error compared to the previous three models. It might implies that the features and its relationship with the response could not be well explained using the Polynomial of Degree 2 (quadratic).

8. Summary of Model Performance¶

We have evaluated the performance of three different regression models: base_model, model_sm, and model_ml. Here is a summary of their performance metrics:

Model Mean Absolute Error (MAE) Mean Squared Error (MSE) Root Mean Squared Error (RMSE) R-squared (R²)
base_model 24125.98 3411215062.30 58405.61 0.47
model_sm 24104.97 3408762712.26 58384.61 0.47
model_ml 22863.97 2491410398.65 49914.03 0.62
model_pl 78113.00 195076371198.6 441674.50 -29.0

Among these models, the model_ml exhibits the lowest Mean Absolute Error and Root Mean Squared Error, along with the highest R-squared value. This suggests that the model_ml performs better compared to the other two models in terms of predictive accuracy.

The features used in by the best model are as follow:

In [65]:
best_features = selected_features_forward[1:]
In [66]:
print(best_features)
['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageArea', 'WoodDeckSF', '3SsnPorch', 'ScreenPorch', 'MSZoning_FV', 'MSZoning_RL', 'Street_Pave', 'LandContour_Low', 'LotConfig_CulDSac', 'LotConfig_FR2', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAn', 'Condition2_Norm', 'Condition2_PosN', 'Condition2_RRNn', 'BldgType_TwnhsE', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2Story', 'HouseStyle_SLvl', 'RoofStyle_Mansard', 'RoofMatl_WdShngl', 'Exterior1st_BrkFace', 'Exterior1st_Wd Sdng', 'Exterior1st_WdShing', 'Exterior2nd_Brk Cmn', 'Exterior2nd_CmentBd', 'Exterior2nd_Other', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'ExterCond_Fa', 'Foundation_PConc', 'Foundation_Wood', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType2_BLQ', 'BsmtFinType2_GLQ', 'BsmtFinType2_Unf', 'Heating_Grav', 'Heating_Wall', 'HeatingQC_Gd', 'HeatingQC_Po', 'HeatingQC_TA', 'Electrical_FuseF', 'Electrical_Mix', 'KitchenQual_Gd', 'Functional_Min1', 'FireplaceQu_Not Available', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Not Available', 'GarageFinish_Not Available', 'Fence_MnWw', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_New', 'SaleCondition_Normal', 'MSSubClass_30', 'MSSubClass_85', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_190', 'OverallQual_6', 'OverallQual_7', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7']

9. Sale Price Prediction on Ames Housing Test Set¶

We will now try to predict the House Sale Price using the feature information given in the test set.

9.1 Importing and Adjusting Dataset

In [67]:
test_set = pd.read_csv('test.csv')
In [68]:
test_set.set_index('Id', inplace=True)
In [69]:
test_set[['MSSubClass', 'OverallQual', 'OverallCond']] = test_set[['MSSubClass', 'OverallQual', 'OverallCond']].astype('object')

9.2 Handling Missing Values

In [70]:
numerical_columns = test_set.select_dtypes(include=['int64', 'float64'])
categorical_columns = test_set.select_dtypes(include=['object'])
In [71]:
categorical_columns.isna().sum()
Out[71]:
MSSubClass          0
MSZoning            4
Street              0
Alley            1352
LotShape            0
LandContour         0
Utilities           2
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
RoofStyle           0
RoofMatl            0
Exterior1st         1
Exterior2nd         1
MasVnrType         16
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           44
BsmtCond           45
BsmtExposure       44
BsmtFinType1       42
BsmtFinType2       42
Heating             0
HeatingQC           0
CentralAir          0
Electrical          0
KitchenQual         1
Functional          2
FireplaceQu       730
GarageType         76
GarageFinish       78
GarageQual         78
GarageCond         78
PavedDrive          0
PoolQC           1456
Fence            1169
MiscFeature      1408
SaleType            1
SaleCondition       0
dtype: int64
In [72]:
not_available = ['Alley','BsmtQual','BsmtCond','BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 
                 'GarageType','GarageFinish', 'GarageQual', 'GarageCond','PoolQC', 'Fence', 'MiscFeature'] 

for col in not_available:
    categorical_columns[col].fillna('Not Available', inplace=True)
In [73]:
categorical_imputer = SimpleImputer(strategy='most_frequent')

real_missing = ['MSZoning', 'Utilities', 'Exterior1st','Exterior2nd', 'MasVnrType', 'KitchenQual', 'Functional', 'SaleType']
categorical_columns[real_missing] = categorical_imputer.fit_transform(categorical_columns[real_missing])
In [74]:
for col in categorical_columns:
    num_categories = categorical_columns[col].unique()
    print(f"Column '{col}' has unique categories: {num_categories}")
Column 'MSSubClass' has unique categories: [20 60 120 160 80 30 50 90 85 190 45 70 75 180 40 150]
Column 'MSZoning' has unique categories: ['RH' 'RL' 'RM' 'FV' 'C (all)']
Column 'Street' has unique categories: ['Pave' 'Grvl']
Column 'Alley' has unique categories: ['Not Available' 'Pave' 'Grvl']
Column 'LotShape' has unique categories: ['Reg' 'IR1' 'IR2' 'IR3']
Column 'LandContour' has unique categories: ['Lvl' 'HLS' 'Bnk' 'Low']
Column 'Utilities' has unique categories: ['AllPub']
Column 'LotConfig' has unique categories: ['Inside' 'Corner' 'FR2' 'CulDSac' 'FR3']
Column 'LandSlope' has unique categories: ['Gtl' 'Mod' 'Sev']
Column 'Neighborhood' has unique categories: ['NAmes' 'Gilbert' 'StoneBr' 'BrDale' 'NPkVill' 'NridgHt' 'Blmngtn'
 'NoRidge' 'Somerst' 'SawyerW' 'Sawyer' 'NWAmes' 'OldTown' 'BrkSide'
 'ClearCr' 'SWISU' 'Edwards' 'CollgCr' 'Crawfor' 'Blueste' 'IDOTRR'
 'Mitchel' 'Timber' 'MeadowV' 'Veenker']
Column 'Condition1' has unique categories: ['Feedr' 'Norm' 'PosN' 'RRNe' 'Artery' 'RRNn' 'PosA' 'RRAn' 'RRAe']
Column 'Condition2' has unique categories: ['Norm' 'Feedr' 'PosA' 'PosN' 'Artery']
Column 'BldgType' has unique categories: ['1Fam' 'TwnhsE' 'Twnhs' 'Duplex' '2fmCon']
Column 'HouseStyle' has unique categories: ['1Story' '2Story' 'SLvl' '1.5Fin' 'SFoyer' '2.5Unf' '1.5Unf']
Column 'OverallQual' has unique categories: [5 6 8 7 4 9 2 3 10 1]
Column 'OverallCond' has unique categories: [6 5 7 8 2 9 3 4 1]
Column 'RoofStyle' has unique categories: ['Gable' 'Hip' 'Gambrel' 'Flat' 'Mansard' 'Shed']
Column 'RoofMatl' has unique categories: ['CompShg' 'Tar&Grv' 'WdShake' 'WdShngl']
Column 'Exterior1st' has unique categories: ['VinylSd' 'Wd Sdng' 'HdBoard' 'Plywood' 'MetalSd' 'CemntBd' 'WdShing'
 'BrkFace' 'AsbShng' 'BrkComm' 'Stucco' 'AsphShn' 'CBlock']
Column 'Exterior2nd' has unique categories: ['VinylSd' 'Wd Sdng' 'HdBoard' 'Plywood' 'MetalSd' 'Brk Cmn' 'CmentBd'
 'ImStucc' 'Wd Shng' 'AsbShng' 'Stucco' 'CBlock' 'BrkFace' 'AsphShn'
 'Stone']
Column 'MasVnrType' has unique categories: ['None' 'BrkFace' 'Stone' 'BrkCmn']
Column 'ExterQual' has unique categories: ['TA' 'Gd' 'Ex' 'Fa']
Column 'ExterCond' has unique categories: ['TA' 'Gd' 'Fa' 'Po' 'Ex']
Column 'Foundation' has unique categories: ['CBlock' 'PConc' 'BrkTil' 'Stone' 'Slab' 'Wood']
Column 'BsmtQual' has unique categories: ['TA' 'Gd' 'Ex' 'Fa' 'Not Available']
Column 'BsmtCond' has unique categories: ['TA' 'Po' 'Fa' 'Gd' 'Not Available']
Column 'BsmtExposure' has unique categories: ['No' 'Gd' 'Mn' 'Av' 'Not Available']
Column 'BsmtFinType1' has unique categories: ['Rec' 'ALQ' 'GLQ' 'Unf' 'BLQ' 'LwQ' 'Not Available']
Column 'BsmtFinType2' has unique categories: ['LwQ' 'Unf' 'Rec' 'BLQ' 'GLQ' 'ALQ' 'Not Available']
Column 'Heating' has unique categories: ['GasA' 'GasW' 'Grav' 'Wall']
Column 'HeatingQC' has unique categories: ['TA' 'Gd' 'Ex' 'Fa' 'Po']
Column 'CentralAir' has unique categories: ['Y' 'N']
Column 'Electrical' has unique categories: ['SBrkr' 'FuseA' 'FuseF' 'FuseP']
Column 'KitchenQual' has unique categories: ['TA' 'Gd' 'Ex' 'Fa']
Column 'Functional' has unique categories: ['Typ' 'Min2' 'Min1' 'Mod' 'Maj1' 'Sev' 'Maj2']
Column 'FireplaceQu' has unique categories: ['Not Available' 'TA' 'Gd' 'Po' 'Fa' 'Ex']
Column 'GarageType' has unique categories: ['Attchd' 'Detchd' 'BuiltIn' 'Not Available' 'Basment' '2Types' 'CarPort']
Column 'GarageFinish' has unique categories: ['Unf' 'Fin' 'RFn' 'Not Available']
Column 'GarageQual' has unique categories: ['TA' 'Not Available' 'Fa' 'Gd' 'Po']
Column 'GarageCond' has unique categories: ['TA' 'Not Available' 'Fa' 'Gd' 'Po' 'Ex']
Column 'PavedDrive' has unique categories: ['Y' 'N' 'P']
Column 'PoolQC' has unique categories: ['Not Available' 'Ex' 'Gd']
Column 'Fence' has unique categories: ['MnPrv' 'Not Available' 'GdPrv' 'GdWo' 'MnWw']
Column 'MiscFeature' has unique categories: ['Not Available' 'Gar2' 'Shed' 'Othr']
Column 'SaleType' has unique categories: ['WD' 'COD' 'New' 'ConLD' 'Oth' 'Con' 'ConLw' 'ConLI' 'CWD']
Column 'SaleCondition' has unique categories: ['Normal' 'Partial' 'Abnorml' 'Family' 'Alloca' 'AdjLand']
In [75]:
numerical_columns.isna().sum().sort_values(ascending=False)
Out[75]:
LotFrontage      227
GarageYrBlt       78
MasVnrArea        15
BsmtHalfBath       2
BsmtFullBath       2
BsmtFinSF1         1
BsmtFinSF2         1
BsmtUnfSF          1
TotalBsmtSF        1
GarageCars         1
GarageArea         1
MoSold             0
WoodDeckSF         0
OpenPorchSF        0
3SsnPorch          0
EnclosedPorch      0
MiscVal            0
Fireplaces         0
PoolArea           0
ScreenPorch        0
HalfBath           0
TotRmsAbvGrd       0
KitchenAbvGr       0
BedroomAbvGr       0
LotArea            0
FullBath           0
GrLivArea          0
LowQualFinSF       0
2ndFlrSF           0
1stFlrSF           0
YearRemodAdd       0
YearBuilt          0
YrSold             0
dtype: int64
In [76]:
continous = ['LotFrontage','BsmtFinSF1','MasVnrArea','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF']
numerical_imputer = SimpleImputer(strategy='mean')
numerical_columns[continous] = numerical_imputer.fit_transform(numerical_columns[continous])
In [77]:
discrete = ['GarageYrBlt','BsmtHalfBath','BsmtFullBath','GarageCars','GarageArea']
numerical_imputer = SimpleImputer(strategy='most_frequent')
numerical_columns[discrete] = numerical_imputer.fit_transform(numerical_columns[discrete])

9.3 Preparing Dataset for Model Prediction

In [78]:
numerical_columns.reset_index(drop=True, inplace=True)
categorical_columns.reset_index(drop=True, inplace=True)

X_test = pd.concat([numerical_columns, categorical_columns], axis=1)
In [79]:
X_test = pd.get_dummies(X_test, drop_first=True)
In [80]:
# Identify missing columns from best_features in the test_set
missing_columns = [col for col in best_features if col not in X_test.columns]

# Add missing columns with zero values
for col in missing_columns:
    X_test[col] = 0
In [81]:
X_test=X_test[best_features]

9.4 Predicting House Sale Price in Ames Test Set

In [82]:
y_pred = model_ml.predict(X_test)
In [83]:
price_prediction = pd.DataFrame({'Id': test_set.index, 'SalePrice': y_pred.tolist()})
In [84]:
price_prediction.to_csv('price_prediction.csv', index=False)

10. Conclusion¶

Based on the model development, it is apparent that Multiple Linear Regression fitted on the training set using mlextend package that is scikit learning friendly returned the highest performance indicated by the highest R-Squared as well as the lowest Error measures compared to the other models.

The features used in the best model are as follows and were used to construct the predictive model to predict the house sale price in Ames, Iowa.

['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageArea', 'WoodDeckSF', '3SsnPorch', 'ScreenPorch', 'MSZoning_FV', 'MSZoning_RL', 'Street_Pave', 'LandContour_Low', 'LotConfig_CulDSac', 'LotConfig_FR2', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAn', 'Condition2_Norm', 'Condition2_PosN', 'Condition2_RRNn', 'BldgType_TwnhsE', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2Story', 'HouseStyle_SLvl', 'RoofStyle_Mansard', 'RoofMatl_WdShngl', 'Exterior1st_BrkFace', 'Exterior1st_Wd Sdng', 'Exterior1st_WdShing', 'Exterior2nd_Brk Cmn', 'Exterior2nd_CmentBd', 'Exterior2nd_Other', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'ExterCond_Fa', 'Foundation_PConc', 'Foundation_Wood', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType2_BLQ', 'BsmtFinType2_GLQ', 'BsmtFinType2_Unf', 'Heating_Grav', 'Heating_Wall', 'HeatingQC_Gd', 'HeatingQC_Po', 'HeatingQC_TA', 'Electrical_FuseF', 'Electrical_Mix', 'KitchenQual_Gd', 'Functional_Min1', 'FireplaceQu_Not Available', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Not Available', 'GarageFinish_Not Available', 'Fence_MnWw', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_New', 'SaleCondition_Normal', 'MSSubClass_30', 'MSSubClass_85', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_190', 'OverallQual_6', 'OverallQual_7', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7']