Putranegara Riauwindu, putrangr@bu.edu
Abstract:
This project presents the development of a predictive model for forecasting house sales prices in Ames, Iowa. The model utilizes essential house features, such as the number of bedrooms, bathrooms, and basement, in combination with the proximity to main infrastructure, such as roads and railways.
To construct and evaluate the model, we employ the comprehensive Ames Housing dataset compiled by Dean De Cock (accessible through this link : https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data). The forecasting mechanism leverages the Multiple Linear Regression algorithms, which considers the selected features' collective impact on the predicted response. For enhanced accuracy, the Stepwise Regression technique is utilized for feature selection, identifying the most significant contributors. Please note that the Ames Housing dataset comes with two separate file of train and test set. We will use train set to build the model and find the model with best performance and make prediction of house SalePrice in Ames using of the best model on the test set.
To ensure dataset reliability and integrity without compromising the dataset originality by minimizing bias, missing values within the dataset will only be imputed if the missing value does not exceed 10% out of all available value within that particular columns. Imputation will employ Multiple Linear Regression for continous variables and k Nearest Neighbors for categorical/ordinal variables. We will use the combination of matplotlib, seaborn, and plotly as a means of visualizing the analysis and model (if necessary).
The final output is a Multiple Linear Regression Model capable of predicting house sales prices in Ames based on specified features. This model offers valuable insights for prospective buyers and sellers in the Ames, Iowa real estate market.
Keywords: Predictive Model, House Sales Prices, Ames, Iowa, House Features, Proximity, Infrastructure, Multiple Linear Regression, Stepwise Regression, Imputation, KNN.
# Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
import missingno as msno
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
import plotly.express as px
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
import time
from sklearn.impute import SimpleImputer
from mlxtend.feature_selection import SequentialFeatureSelector
# Output Display Setting
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
warnings.filterwarnings("ignore", category=UserWarning)
Setting Plotly Display Configuration
def enable_plotly_in_cell():
import IPython
from plotly.offline import init_notebook_mode
display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
init_notebook_mode(connected=False)
data = pd.read_csv('train.csv')
data.shape
(1460, 81)
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
The training dataset contains 1460 observations with 80 total feature columns (excluding the response column which is the 'SalePrice'.
We will now proceed to check whether there are missing values in the training set.
# Calculating missing value for all of the columns
missing_values = data.isna().sum()/data.shape[0]
# Filter columns with more than 0 missing values
columns_with_missing = missing_values[missing_values > 0].sort_values(ascending=False)
# Display the filtered result
print(columns_with_missing)
PoolQC 0.995205 MiscFeature 0.963014 Alley 0.937671 Fence 0.807534 FireplaceQu 0.472603 LotFrontage 0.177397 GarageType 0.055479 GarageYrBlt 0.055479 GarageFinish 0.055479 GarageQual 0.055479 GarageCond 0.055479 BsmtExposure 0.026027 BsmtFinType2 0.026027 BsmtFinType1 0.025342 BsmtCond 0.025342 BsmtQual 0.025342 MasVnrArea 0.005479 MasVnrType 0.005479 Electrical 0.000685 dtype: float64
# Visualizing the training dataset missing values
pastel_colors = sns.color_palette("pastel")
msno.bar(data, color=pastel_colors)
plt.show()
The plot and calculation above showed that there are severals columns identified to having multiple missing values which are:
After reviewing the dataset description, it appears that only the columns "LotFrontage" and "Electrical" contain genuine missing values. The other columns were previously identified to have apparent missing values due to the presence of "NA" values, but these "NA" values actually indicate the unavailability of a specific feature in those particular observations. Therefore, these "NA" values are not true missing values.
In summary, the dataset has genuine missing values in "LotFrontage" and "Electrical", while the "NA" values in other columns do not represent missing data but rather indicate the absence of certain features.
Due to the size of the missing values, the observations that contains missing value in column "LotFrontage" will be removed while the one in "Electrical" column will be imputed using KNN model in order to minimize bias. The missing value will be imputed first before we explore the dataset in the Exploratory Data Analysis (EDA) phase due to its only being one observation.
2.3.1 Adjusting the NA values in the column that imply the absence of that particular feature
# Columns to exclude from imputation
exclude_columns = ['LotFrontage', 'Electrical', 'MasVnrArea', 'GarageYrBlt']
zero_columns = ['MasVnrArea', 'GarageYrBlt']
# Fill missing values with "Not Available" for non-excluded columns
data[data.columns.difference(exclude_columns)] = data[data.columns.difference(exclude_columns)].fillna("Not Available")
# Fill missing values in 'MasVnrArea' and 'GarageYrBlt' with 0 as they correspond to the associated related columns
data['MasVnrArea'].fillna(0, inplace=True)
data['GarageYrBlt'].fillna(0, inplace=True)
2.3.2 Checking whether the designated column`s missing value have been properly adjusted
# List of columns with categorical nature
categorical_vars = data.select_dtypes(include='object').columns.tolist()
# Loop through each categorical column and print unique categories
for col in categorical_vars:
num_categories = data[col].unique()
print(f"Column '{col}' has unique categories: {num_categories}")
Column 'MSZoning' has unique categories: ['RL' 'RM' 'C (all)' 'FV' 'RH'] Column 'Street' has unique categories: ['Pave' 'Grvl'] Column 'Alley' has unique categories: ['Not Available' 'Grvl' 'Pave'] Column 'LotShape' has unique categories: ['Reg' 'IR1' 'IR2' 'IR3'] Column 'LandContour' has unique categories: ['Lvl' 'Bnk' 'Low' 'HLS'] Column 'Utilities' has unique categories: ['AllPub' 'NoSeWa'] Column 'LotConfig' has unique categories: ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3'] Column 'LandSlope' has unique categories: ['Gtl' 'Mod' 'Sev'] Column 'Neighborhood' has unique categories: ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes' 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR' 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill' 'Blmngtn' 'BrDale' 'SWISU' 'Blueste'] Column 'Condition1' has unique categories: ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe'] Column 'Condition2' has unique categories: ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe'] Column 'BldgType' has unique categories: ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs'] Column 'HouseStyle' has unique categories: ['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin'] Column 'RoofStyle' has unique categories: ['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed'] Column 'RoofMatl' has unique categories: ['CompShg' 'WdShngl' 'Metal' 'WdShake' 'Membran' 'Tar&Grv' 'Roll' 'ClyTile'] Column 'Exterior1st' has unique categories: ['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd' 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc' 'CBlock'] Column 'Exterior2nd' has unique categories: ['VinylSd' 'MetalSd' 'Wd Shng' 'HdBoard' 'Plywood' 'Wd Sdng' 'CmentBd' 'BrkFace' 'Stucco' 'AsbShng' 'Brk Cmn' 'ImStucc' 'AsphShn' 'Stone' 'Other' 'CBlock'] Column 'MasVnrType' has unique categories: ['BrkFace' 'None' 'Stone' 'BrkCmn' 'Not Available'] Column 'ExterQual' has unique categories: ['Gd' 'TA' 'Ex' 'Fa'] Column 'ExterCond' has unique categories: ['TA' 'Gd' 'Fa' 'Po' 'Ex'] Column 'Foundation' has unique categories: ['PConc' 'CBlock' 'BrkTil' 'Wood' 'Slab' 'Stone'] Column 'BsmtQual' has unique categories: ['Gd' 'TA' 'Ex' 'Not Available' 'Fa'] Column 'BsmtCond' has unique categories: ['TA' 'Gd' 'Not Available' 'Fa' 'Po'] Column 'BsmtExposure' has unique categories: ['No' 'Gd' 'Mn' 'Av' 'Not Available'] Column 'BsmtFinType1' has unique categories: ['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' 'Not Available' 'LwQ'] Column 'BsmtFinType2' has unique categories: ['Unf' 'BLQ' 'Not Available' 'ALQ' 'Rec' 'LwQ' 'GLQ'] Column 'Heating' has unique categories: ['GasA' 'GasW' 'Grav' 'Wall' 'OthW' 'Floor'] Column 'HeatingQC' has unique categories: ['Ex' 'Gd' 'TA' 'Fa' 'Po'] Column 'CentralAir' has unique categories: ['Y' 'N'] Column 'Electrical' has unique categories: ['SBrkr' 'FuseF' 'FuseA' 'FuseP' 'Mix' nan] Column 'KitchenQual' has unique categories: ['Gd' 'TA' 'Ex' 'Fa'] Column 'Functional' has unique categories: ['Typ' 'Min1' 'Maj1' 'Min2' 'Mod' 'Maj2' 'Sev'] Column 'FireplaceQu' has unique categories: ['Not Available' 'TA' 'Gd' 'Fa' 'Ex' 'Po'] Column 'GarageType' has unique categories: ['Attchd' 'Detchd' 'BuiltIn' 'CarPort' 'Not Available' 'Basment' '2Types'] Column 'GarageFinish' has unique categories: ['RFn' 'Unf' 'Fin' 'Not Available'] Column 'GarageQual' has unique categories: ['TA' 'Fa' 'Gd' 'Not Available' 'Ex' 'Po'] Column 'GarageCond' has unique categories: ['TA' 'Fa' 'Not Available' 'Gd' 'Po' 'Ex'] Column 'PavedDrive' has unique categories: ['Y' 'N' 'P'] Column 'PoolQC' has unique categories: ['Not Available' 'Ex' 'Fa' 'Gd'] Column 'Fence' has unique categories: ['Not Available' 'MnPrv' 'GdWo' 'GdPrv' 'MnWw'] Column 'MiscFeature' has unique categories: ['Not Available' 'Shed' 'Gar2' 'Othr' 'TenC'] Column 'SaleType' has unique categories: ['WD' 'New' 'COD' 'ConLD' 'ConLI' 'CWD' 'ConLw' 'Con' 'Oth'] Column 'SaleCondition' has unique categories: ['Normal' 'Abnorml' 'Partial' 'AdjLand' 'Alloca' 'Family']
# Calculating missing value for all of the columns
missing_values = data.isna().sum()/data.shape[0]
# Filter columns with more than 0 missing values
columns_with_missing = missing_values[missing_values > 0].sort_values(ascending=False)
# Display the filtered result
print(columns_with_missing)
LotFrontage 0.177397 Electrical 0.000685 dtype: float64
The designated columns have been properly adjusted. Now we will go over to handle the remaining missing values in the 'LotFrontage' and 'Electrical' columns. From the information above, 'Electrical' only has 1 observation with missing value.
2.3.3 Removing observation with missing value in "LotFrontage" columns
data.dropna(subset=['LotFrontage'], inplace=True)
2.3.4 Imputing observation with missing value in 'Electrical' column
# Checking for the 'Electrical' column unique values
data['Electrical'].unique()
array(['SBrkr', 'FuseF', 'FuseA', 'FuseP', 'Mix', nan], dtype=object)
'Electrical' column apparently has 5 different categories that describe the electrical system type for each observation. This info later confirmed by crosschecking to the dataset description.
We will build KNN Model to predict what kind of electrical system the observation with missing value in that column have.
KNN model will be built in the following sequences:
2.3.5 Subsetting the dataset to numeric variables only and add response column (Electrical)
# Creating master data for knn modeling
knn = data.select_dtypes(include=['int64', 'float64'])
knn.drop('Id', axis=1, inplace=True)
knn['Electrical'] = data['Electrical']
2.3.6 Separating the dataset from the observation with electrical missing value
# Creating subset of dataset to be predicted
electrical_missing = knn[knn['Electrical'].isna()]
electrical_missing.drop('Electrical', axis=1, inplace=True)
knn.dropna(subset=['Electrical'], inplace=True)
2.3.7 Preparing dataset for KNN Model building
knn['Electrical'].value_counts()
SBrkr 1091 FuseA 80 FuseF 25 FuseP 3 Mix 1 Name: Electrical, dtype: int64
Since 'Mix' category within the Electrical column only has 1 observation, it will not be suitable for data partitioning for model building. For this analysis, we will drop the observation with 'Electrical' value == 'Mix'.
It is also noteworthy that the naive-rate for this classification process is around 90% that will predict the SBrkr as the output label for the missing value.
knn = knn[knn['Electrical'] != 'Mix']
# Subsetting the predictor and response columns
X = knn.drop('Electrical', axis=1)
y = knn['Electrical']
2.3.8 Training KNN Model on Training Dataset
## Partitioning dataset into training and test set
X_train,X_test,y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=256, stratify=y)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# Training KNN Model with initial k=3
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train_sc,y_train)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
2.3.9 Finding the optimum number of K through Accuracy measure
k_values = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
accuracy_results = []
for k in k_values:
classifier = KNeighborsClassifier(n_neighbors=k, p=2, metric='euclidean')
classifier.fit(X_train_sc, y_train)
y_pred = classifier.predict(X_test_sc)
accuracy = accuracy_score(y_test, y_pred)
accuracy_results.append((k, accuracy))
print(accuracy_results)
[(3, 0.8805555555555555), (5, 0.8861111111111111), (7, 0.8972222222222223), (9, 0.9083333333333333), (11, 0.9138888888888889), (13, 0.9138888888888889), (15, 0.9138888888888889), (17, 0.9138888888888889), (19, 0.9138888888888889), (21, 0.9138888888888889)]
The accuracy measures returned that k=11 is the best k with the highest accuracy at 0.91. We will build the KNN model with k=11 and make prediction for the missing value.
classifier = KNeighborsClassifier(n_neighbors=11)
classifier.fit(X_train_sc,y_train)
KNeighborsClassifier(n_neighbors=11)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=11)
2.3.10 Making prediction on the missing value of Electrical column and imputing it to the original training dataset
scaler = StandardScaler()
electrical_missing_sc = scaler.fit_transform(electrical_missing)
y_pred_electrical = classifier.predict(electrical_missing_sc)
y_pred_electrical
array(['SBrkr'], dtype=object)
data['Electrical'].fillna('SBrkr', inplace=True)
data.isna().sum().any()
False
All missing values have been imputed and cleaned ready for the Multiple Linear Regression modeling to predict the sales price of each of the houses.
data.set_index('Id', inplace=True)
# Features
X = data.drop(columns=['SalePrice'])
# Response
y = data['SalePrice']
numeric_features = X.select_dtypes(include=['int64', 'float64'])
categorical_features = X.select_dtypes(include=['object'])
Lets check whether the features have been properly grouped or not.
categorical_features.head()
MSZoning | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | Heating | HeatingQC | CentralAir | Electrical | KitchenQual | Functional | FireplaceQu | GarageType | GarageFinish | GarageQual | GarageCond | PavedDrive | PoolQC | Fence | MiscFeature | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||||||||||||||||||||||||
1 | RL | Pave | Not Available | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | No | GLQ | Unf | GasA | Ex | Y | SBrkr | Gd | Typ | Not Available | Attchd | RFn | TA | TA | Y | Not Available | Not Available | Not Available | WD | Normal |
2 | RL | Pave | Not Available | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | Gable | CompShg | MetalSd | MetalSd | None | TA | TA | CBlock | Gd | TA | Gd | ALQ | Unf | GasA | Ex | Y | SBrkr | TA | Typ | TA | Attchd | RFn | TA | TA | Y | Not Available | Not Available | Not Available | WD | Normal |
3 | RL | Pave | Not Available | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | Mn | GLQ | Unf | GasA | Ex | Y | SBrkr | Gd | Typ | TA | Attchd | RFn | TA | TA | Y | Not Available | Not Available | Not Available | WD | Normal |
4 | RL | Pave | Not Available | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | Gable | CompShg | Wd Sdng | Wd Shng | None | TA | TA | BrkTil | TA | Gd | No | ALQ | Unf | GasA | Gd | Y | SBrkr | Gd | Typ | Gd | Detchd | Unf | TA | TA | Y | Not Available | Not Available | Not Available | WD | Abnorml |
5 | RL | Pave | Not Available | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | Av | GLQ | Unf | GasA | Ex | Y | SBrkr | Gd | Typ | TA | Attchd | RFn | TA | TA | Y | Not Available | Not Available | Not Available | WD | Normal |
Categorical features seem to be properly grouped after cross-checking it with the dataset description, now let`s check for the numeric features.
numeric_features.head()
MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | ||||||||||||||||||||||||||||||||||||
1 | 60 | 65.0 | 8450 | 7 | 5 | 2003 | 2003 | 196.0 | 706 | 0 | 150 | 856 | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | 8 | 0 | 2003.0 | 2 | 548 | 0 | 61 | 0 | 0 | 0 | 0 | 0 | 2 | 2008 |
2 | 20 | 80.0 | 9600 | 6 | 8 | 1976 | 1976 | 0.0 | 978 | 0 | 284 | 1262 | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | 6 | 1 | 1976.0 | 2 | 460 | 298 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2007 |
3 | 60 | 68.0 | 11250 | 7 | 5 | 2001 | 2002 | 162.0 | 486 | 0 | 434 | 920 | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | 6 | 1 | 2001.0 | 2 | 608 | 0 | 42 | 0 | 0 | 0 | 0 | 0 | 9 | 2008 |
4 | 70 | 60.0 | 9550 | 7 | 5 | 1915 | 1970 | 0.0 | 216 | 0 | 540 | 756 | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | 7 | 1 | 1998.0 | 3 | 642 | 0 | 35 | 272 | 0 | 0 | 0 | 0 | 2 | 2006 |
5 | 60 | 84.0 | 14260 | 8 | 5 | 2000 | 2000 | 350.0 | 655 | 0 | 490 | 1145 | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | 9 | 1 | 2000.0 | 3 | 836 | 192 | 84 | 0 | 0 | 0 | 0 | 0 | 12 | 2008 |
It seems that there are couples of features identified as numeric but are actually categorical/ordinal in nature as per the dataset description. That features are 'MSSubClass', 'OverallQual', 'OverallCond', and 'MoSold'.
For this analysis, 'MSSubClass', 'OverallQual', 'OverallCond' features will be converted to categorical and moved into categorical_features, while 'MoSold' will be removed as it is redundant information provided by 'YrSold'.
numeric_features[['MSSubClass', 'OverallQual', 'OverallCond']] = numeric_features[['MSSubClass', 'OverallQual', 'OverallCond']].astype('object')
categorical_features[['MSSubClass', 'OverallQual', 'OverallCond']] = numeric_features[['MSSubClass', 'OverallQual', 'OverallCond']]
numeric_features.drop(columns=['MoSold', 'MSSubClass', 'OverallQual', 'OverallCond'], inplace=True)
Now let us explore the relationship between the features and response to see whether there are any interesting or discernable pattern through the plotting of each of the features against the response.
For this objective, response will be plotted against numeric features and categorical features separately.
4.1.1 Sale Price vs Numerical Features Scatterplot
# Set the number of columns for the grid
num_columns = 4
# Calculate the number of rows needed to accommodate all features
num_features = numeric_features.shape[1]
num_rows = (num_features - 1) // num_columns + 1
# Create subplots with the specified number of rows and columns
fig, axes = plt.subplots(num_rows, num_columns, figsize=(num_columns * 5, num_rows * 5))
# Flatten the axes array if there is only one row
if num_rows == 1:
axes = [axes]
# Plot pairwise scatterplots
for i, column in enumerate(numeric_features.columns):
sns.scatterplot(x=column, y='SalePrice', data=numeric_features.join(y), ax=axes[i // num_columns, i % num_columns])
# Remove any empty subplots
for i in range(num_features, num_rows * num_columns):
fig.delaxes(axes.flatten()[i])
plt.show()
From the plot above, we could see that the majority of the numerical features, although some of the features does not exhibit clear relationship, has somewhat linear relationship with the SalePrice, hinting that Multiple Linear Regression might be the good choice of algorithm to use to predict the response value. We will now check the correlation value between the SalePrice and Numeric Features to confirm the findings.
4.1.2 Sale Price vs Numerical Features Correlation
# Calculate the correlation matrix between SalePrice and Numerical Features
sale_numfeatures_corr = numeric_features.join(y).corr()
sale_numfeatures_corr = sale_numfeatures_corr['SalePrice'].abs().sort_values(ascending=False).drop('SalePrice')
sale_numfeatures_corr
GrLivArea 0.703557 GarageCars 0.647302 GarageArea 0.631761 TotalBsmtSF 0.626977 1stFlrSF 0.620159 FullBath 0.566800 YearBuilt 0.538913 TotRmsAbvGrd 0.537215 YearRemodAdd 0.519471 MasVnrArea 0.492084 Fireplaces 0.479353 BsmtFinSF1 0.403195 LotFrontage 0.351799 WoodDeckSF 0.345054 OpenPorchSF 0.326530 LotArea 0.311416 2ndFlrSF 0.308957 HalfBath 0.284770 GarageYrBlt 0.262026 BsmtFullBath 0.237637 BsmtUnfSF 0.215044 BedroomAbvGr 0.164554 EnclosedPorch 0.164004 KitchenAbvGr 0.140626 ScreenPorch 0.121166 PoolArea 0.092924 MiscVal 0.049435 3SsnPorch 0.035755 BsmtHalfBath 0.028450 LowQualFinSF 0.022339 YrSold 0.020601 BsmtFinSF2 0.017682 Name: SalePrice, dtype: float64
The majority of numerical features show a relatively strong linear relationship, either positive or negative, with the SalePrice. However, there are a few features that seem to be indifferent, namely 'PoolArea', 'MiscVal', '3SsnPorch', 'BsmtHalfBath', 'LowQualFinSF', 'YrSold', and 'BsmtFinSF2'. These features will be taken into consideration for the feature selection in the later section.
Now we will investigate visually the relationship between the SalePrice and the categorical features to see whether there are any meaningful relationship between them.
# Set the number of columns for the grid
num_columns = 4
# Calculate the number of rows needed to accommodate all features
cat_features = categorical_features.shape[1]
num_rows = (cat_features - 1) // num_columns + 1
# Create subplots with the specified number of rows and columns
fig, axes = plt.subplots(num_rows, num_columns, figsize=(num_columns * 5, num_rows * 5))
# Flatten the axes array if there is only one row
if num_rows == 1:
axes = [axes]
# Plot pairwise scatterplots
for i, column in enumerate(categorical_features.columns):
sns.boxplot(x=column, y='SalePrice', data=categorical_features.join(y), ax=axes[i // num_columns, i % num_columns])
# Remove any empty subplots
for i in range(cat_features, num_rows * num_columns):
fig.delaxes(axes.flatten()[i])
plt.show()
From the above boxplot we could observe that the majority of the categorical predictors seems to have some positive relationship with the SalePrice indicating that these features could be used to predict the difference in SalePrice given the various categorical features. However there are couple of predictors that might not be as useful to predict the SalePrice due to the feature only have one level or the difference between levels in that particular features does not differ very much, those features are: 'LandSlope' and 'Utilities'. These features will be taken into consideration for feature selection in the later section.
We will now check whether there are heavy correlation between numerical features that might cause multicollinearity when building the Multiple Linear Regression model.
4.3.1 Pairwise Correlation Between Numerical Features
enable_plotly_in_cell
# Calculate the correlation matrix
correlation_matrix = numeric_features.corr()
# Create the heatmap using Plotly
fig = px.imshow(correlation_matrix, x=numeric_features.columns, y=numeric_features.columns,
color_continuous_scale='RdBu', title='Correlation Heatmap')
fig.show()
Referring to the "Regression Diagnostics: Identifying Influential Data and Sources of Collinearity" book written by Belsley et al, we could use the correlation threshold of 0.7 to distinguish the independent features that might pose the risk of multicollinearity and based on this threshold, there are couple of features that might pose risk of multicollinearity which are 'GrLivArea' & 'TotRmsAbvGrd', '2ndFlrSF' & 'GrLivArea', 'TotalBsmtSF' & '1stFlrSF', 'GarageCars' & 'GarageArea'.
In order to confirm it, we will conduct Variance Inflation Factor (VIF) to validate the finding.
4.3.2 Variance Inflation Factor (VIF) Between Numerical Features
The Variance Inflation Factor (VIF) is a statistical measure used to assess multicollinearity in a regression model. Multicollinearity occurs when two or more independent variables are highly correlated, making it challenging to interpret their individual effects.
VIF quantifies how much the variance of an estimated regression coefficient increases when a particular predictor is included in the model compared to when it is excluded. A VIF of 1 indicates no multicollinearity, while values above 1 suggest increasing levels of multicollinearity.
The formula for VIF is VIF_i = 1 / (1 - R²_i)
The VIF value represents how much the variance of the coefficient of "X_i" is inflated due to multicollinearity. If there is no multicollinearity, the VIF will be 1, indicating that the variance of the coefficient is not affected by correlations with other predictors. However, as the correlation between "X_i" and other predictors increases, the VIF value will rise, indicating more severe multicollinearity.
Interpretation of VIF values:
High VIF values indicate that a predictor is highly correlated with other predictors, making it challenging to interpret its independent contribution accurately. In such cases, addressing multicollinearity is important, such as removing correlated predictors or using dimensionality reduction techniques.
We will investigate the VIF for numerical features below.
# Create a DataFrame to store VIF results
vif_data = pd.DataFrame()
vif_data['Feature'] = numeric_features.columns
# Calculate VIF for each feature
vif_data['VIF'] = [variance_inflation_factor(numeric_features.values, i) for i in range(numeric_features.shape[1])]
pd.options.display.float_format = '{:.2f}'.format
# Display VIF results
print(vif_data.sort_values(by='VIF',ascending=False))
Feature VIF 9 1stFlrSF inf 8 TotalBsmtSF inf 12 GrLivArea inf 11 LowQualFinSF inf 10 2ndFlrSF inf 5 BsmtFinSF1 inf 6 BsmtFinSF2 inf 7 BsmtUnfSF inf 31 YrSold 19197.98 3 YearRemodAdd 16537.37 2 YearBuilt 12261.87 19 TotRmsAbvGrd 81.39 22 GarageCars 38.76 23 GarageArea 31.94 21 GarageYrBlt 30.19 18 KitchenAbvGr 29.82 17 BedroomAbvGr 29.37 15 FullBath 26.43 0 LotFrontage 14.94 13 BsmtFullBath 3.54 1 LotArea 3.48 16 HalfBath 3.37 20 Fireplaces 2.84 4 MasVnrArea 1.89 24 WoodDeckSF 1.88 25 OpenPorchSF 1.86 26 EnclosedPorch 1.45 28 ScreenPorch 1.22 14 BsmtHalfBath 1.19 29 PoolArea 1.16 30 MiscVal 1.09 27 3SsnPorch 1.04
VIF analysis showed that there are multiple of numerical features that exhibited very high VIF thus indicating high risk of multicollinearity when loaded into Multiple Linear Regression Model. Using threshold of 10 for VIF, these features poses high risk of multicollinearity: 1stFlrSF, TotalBsmtSF, GrLivArea, LowQualFinSF, 2ndFlrSF, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, YrSold, YearRemodAdd, YearBuilt, TotRmsAbvGrd, GarageCars, GarageArea, GarageYrBlt, KitchenAbvGr, BedroomAbvGr, FullBath, LotFrontage
Combining this information with the information from the pairwise correlation and the visual relation between response and features, I will remove these numerical features and re-perform VIF before loading it up into the MLR model: 'YrSold', 'LowQualFinSF', 'GrLivArea', 'TotRmsAbvGrd', 'TotalBsmtSF', '1stFlrSF','2ndFlrSF', 'BsmtFinSF2', 'GarageCars', 'YearRemodAdd', 'YearBuilt', and 'GarageYrBlt'.
We will also perform Stepwise Regression to also perform additional feature selection after fitting the dataset into MLR in later section.
numeric_features_new = numeric_features.drop(columns=['YrSold', 'LowQualFinSF', 'GrLivArea', 'TotRmsAbvGrd',
'TotalBsmtSF', '1stFlrSF','2ndFlrSF', 'BsmtFinSF2',
'GarageCars','YearRemodAdd','GarageYrBlt','YearBuilt'])
# Create a DataFrame to store VIF results
vif_data = pd.DataFrame()
vif_data['Feature'] = numeric_features_new.columns
# Calculate VIF for each feature
vif_data['VIF'] = [variance_inflation_factor(numeric_features_new.values, i) for i in range(numeric_features_new.shape[1])]
pd.options.display.float_format = '{:.2f}'.format
# Display VIF results
print(vif_data.sort_values(by='VIF',ascending=False))
Feature VIF 9 BedroomAbvGr 17.53 7 FullBath 14.63 10 KitchenAbvGr 14.47 0 LotFrontage 13.55 12 GarageArea 9.24 3 BsmtFinSF1 5.37 4 BsmtUnfSF 5.32 1 LotArea 3.34 5 BsmtFullBath 3.18 11 Fireplaces 2.49 8 HalfBath 1.90 13 WoodDeckSF 1.83 14 OpenPorchSF 1.76 2 MasVnrArea 1.73 15 EnclosedPorch 1.22 17 ScreenPorch 1.19 6 BsmtHalfBath 1.15 18 PoolArea 1.09 19 MiscVal 1.07 16 3SsnPorch 1.04
The VIF value for the selected numerical features seems low enough compared to the previous VIF value before the removal of the high VIF value features. For this analysis, I will use these combination of features to load into MLR and perform further feature selection using Stepwise Regression.
Please note that for categorical features, the complete feature selection will be done through the Stepwise Regression since there is no straightforward method that is analogous to the VIF in numerical features.
We will now fit the MLR model using the dataset with features preliminary selected during the previous analysis and check for the model performance. For this analysis, this model will be referred to as base model
# Defining Features
numeric_features.reset_index(drop=True, inplace=True)
categorical_features.reset_index(drop=True, inplace=True)
X = pd.concat([numeric_features, categorical_features], axis=1)
# Removing features as per previous analysis
X = X.drop(columns=['YrSold', 'LowQualFinSF', 'GrLivArea', 'TotRmsAbvGrd',
'TotalBsmtSF', '1stFlrSF','2ndFlrSF', 'BsmtFinSF2',
'GarageCars','YearRemodAdd','GarageYrBlt','YearBuilt','Utilities', 'LandSlope'])
# Dummifying Categorical Features
X = pd.get_dummies(X, drop_first=True)
# Checking the number of Features after Dummification
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1201 entries, 0 to 1200 Columns: 264 entries, LotFrontage to OverallCond_9 dtypes: float64(2), int64(18), uint8(244) memory usage: 474.0 KB
# Splitting the Features and Response into 70% Training and 30% Test Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=296)
# Training the MLR model on training set
base_model = LinearRegression()
base_model.fit(X_train,y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# Creating Prediction on test set
y_pred_base_model = base_model.predict(X_test)
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_base_model)
mse = mean_squared_error(y_test, y_pred_base_model)
rmse = mean_squared_error(y_test, y_pred_base_model, squared=False)
r2 = r2_score(y_test, y_pred_base_model)
# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 24125.98192970704 Mean Squared Error (MSE): 3411215062.296772 Root Mean Squared Error (RMSE): 58405.60814080076 R-squared (R²): 0.474477543732041
The Multiple Linear Regression fitted on the selected numerical features previously screened using VIF and the rest of the categorical features performed quite decent but not quite satisfactory as it could only explain around 47% of the variability within the dataset.
We will now try to perform further feature selection using Stepwise Regression (Forward and Backward) using statsmodel and mlextend package in an attempt to decrease the RMSE as well as increasing the R-Squared.
Stepwise regression is a feature selection technique used to build a regression model by iteratively selecting or removing features from the model based on certain statistical criteria. It aims to find the best subset of features that provides the most accurate predictions while avoiding overfitting.
There are two main types of stepwise regression: forward stepwise regression and backward stepwise regression.
1. Forward Stepwise Regression:
Start with an empty model that includes only the intercept.
Add one feature at a time to the model, selecting the feature that results in the best improvement in model fit (e.g., higher R-squared or lower error).
Continue adding features until a stopping criterion is met (e.g., no more statistically significant features to add or a predefined number of features is reached).
2. Backward Stepwise Regression:
Start with a full model that includes all features.
Remove one feature at a time from the model, selecting the feature to remove based on the least contribution to model fit (e.g., higher p-values or lower reduction in model performance).
Continue removing features until a stopping criterion is met (e.g., all remaining features are statistically significant or a predefined number of features is reached).
Both forward and backward stepwise regression use some statistical measure (e.g., adjusted R-squared, AIC, BIC, p-values) to evaluate the model's fit and the importance of each feature. The goal is to find a balance between model complexity and accuracy, ensuring that the final model has the best predictive power without unnecessary overfitting.
Please note that for this analysis, I will use R-squared as the statistical measure and significance level of 0.05 to define the selection on the subset of features and also use the Forward Stepwise Regression.
6.1 Forward Stepwise Regression using statsmodel
Package
The MLR model built using the features selected from the Stepwise Regression using statsmodel
package will be referred to as model_sm
# Reset the indices of X_train and y_train
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
# Add a constant column to X for the intercept term
X_train = sm.add_constant(X_train)
# Stepwise forward feature selection
def forward_feature_selection(X, y, significance_level=0.05):
# Setting the initial condition
remaining_features = set(X.columns)
selected_features = []
current_score = 0.0
# Looping through all of the available predictors
while len(remaining_features) > 0:
scores_with_candidates = []
# Fitting the linear regression and store the adjusted R-squared value with all available predictors one by one
for feature in remaining_features:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[selected_features + [feature]]))).fit()
score = model.rsquared
scores_with_candidates.append((score, feature))
# Sorting the score
scores_with_candidates.sort(reverse=True)
best_new_score, best_feature = scores_with_candidates.pop(0)
# Choosing the combination of predictors based on the highest adjusted R-squared score
if current_score < best_new_score:
remaining_features.remove(best_feature)
selected_features.append(best_feature)
current_score = best_new_score
else:
break
# Fitting the linear regression model with the best combination of features
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[selected_features]))).fit()
return model, selected_features
**Please note that before running below code cells, the time taken to complete the computation was around 1626 seconds ~ 28 minutes**
# Record the start time
start_time = time.time()
# Perform forward stepwise regression on training set
model_forward, selected_features_forward = forward_feature_selection(X_train, y_train)
# Record the end time
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print("")
print(f"Elapsed time: {elapsed_time:.6f} seconds")
Elapsed time: 1626.735968 seconds
# Print results
print("\nModel Summary:")
print(model_forward.summary())
print("")
print("Selected Features in Forward Stepwise Regression:")
print("")
print(selected_features_forward)
Model Summary: OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.946 Model: OLS Adj. R-squared: 0.924 Method: Least Squares F-statistic: 42.21 Date: Tue, 08 Aug 2023 Prob (F-statistic): 3.94e-271 Time: 16:52:57 Log-Likelihood: -9495.8 No. Observations: 840 AIC: 1.949e+04 Df Residuals: 593 BIC: 2.065e+04 Df Model: 246 Covariance Type: nonrobust ============================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------- const -3.68e+06 7.33e+05 -5.022 0.000 -5.12e+06 -2.24e+06 GarageArea 34.6927 8.254 4.203 0.000 18.481 50.904 ExterQual_TA -4.024e+04 8722.476 -4.613 0.000 -5.74e+04 -2.31e+04 Fireplaces 7257.9277 4021.812 1.805 0.072 -640.801 1.52e+04 OverallQual_10 4.91e+04 4e+04 1.228 0.220 -2.94e+04 1.28e+05 OverallQual_9 1.295e+04 3.86e+04 0.335 0.738 -6.29e+04 8.88e+04 OverallQual_8 -3034.4679 3.8e+04 -0.080 0.936 -7.76e+04 7.15e+04 FullBath 1.885e+04 3186.617 5.914 0.000 1.26e+04 2.51e+04 BsmtFinSF1 56.6231 5.956 9.506 0.000 44.925 68.321 Condition2_PosN -4.485e+05 4.61e+04 -9.727 0.000 -5.39e+05 -3.58e+05 LotArea 1.4904 0.331 4.501 0.000 0.840 2.141 HalfBath 1.049e+04 2912.818 3.601 0.000 4769.542 1.62e+04 SaleType_New 2.092e+04 1.84e+04 1.136 0.257 -1.53e+04 5.71e+04 OverallQual_7 -2.173e+04 3.78e+04 -0.576 0.565 -9.59e+04 5.24e+04 Neighborhood_NoRidge 2.866e+04 1.2e+04 2.383 0.017 5038.792 5.23e+04 BsmtUnfSF 37.1896 5.438 6.839 0.000 26.509 47.870 GarageType_BuiltIn 5.568e+04 1.85e+04 3.011 0.003 1.94e+04 9.2e+04 Neighborhood_Crawfor 8760.5370 1.22e+04 0.718 0.473 -1.52e+04 3.27e+04 BsmtFinType2_Unf -3.216e+04 8939.675 -3.597 0.000 -4.97e+04 -1.46e+04 MSSubClass_60 1243.3843 1e+04 0.124 0.902 -1.85e+04 2.1e+04 OverallCond_7 1.167e+04 2.06e+04 0.565 0.572 -2.89e+04 5.22e+04 BsmtExposure_Gd 1.704e+04 4536.787 3.756 0.000 8130.382 2.6e+04 OverallCond_3 -2.712e+04 2.25e+04 -1.206 0.228 -7.13e+04 1.7e+04 KitchenAbvGr -1.235e+04 8545.471 -1.445 0.149 -2.91e+04 4434.731 Neighborhood_StoneBr 3.073e+04 1.18e+04 2.603 0.009 7540.332 5.39e+04 Exterior1st_BrkFace 2.423e+04 1.67e+04 1.447 0.148 -8650.360 5.71e+04 OverallCond_8 1.268e+04 2.09e+04 0.606 0.545 -2.84e+04 5.38e+04 PoolArea 7122.8701 1343.908 5.300 0.000 4483.472 9762.269 MasVnrArea 20.3662 8.004 2.544 0.011 4.646 36.086 PoolQC_Not Available 3.711e+06 7.28e+05 5.097 0.000 2.28e+06 5.14e+06 PoolQC_Gd -3.402e+05 6.02e+04 -5.653 0.000 -4.58e+05 -2.22e+05 Foundation_CBlock -498.9458 4868.473 -0.102 0.918 -1.01e+04 9062.601 OverallCond_6 3053.2308 2.06e+04 0.148 0.882 -3.74e+04 4.35e+04 BsmtFinType1_GLQ 1.066e+04 3778.382 2.822 0.005 3242.398 1.81e+04 BsmtQual_Gd -1.503e+04 4692.259 -3.204 0.001 -2.42e+04 -5818.647 SaleCondition_Normal 9628.4954 4231.520 2.275 0.023 1317.907 1.79e+04 MSSubClass_70 -6981.5219 1.16e+04 -0.602 0.547 -2.97e+04 1.58e+04 BsmtFinType2_GLQ 9710.9154 1.23e+04 0.790 0.430 -1.44e+04 3.38e+04 Street_Pave 2.345e+04 2.4e+04 0.978 0.328 -2.36e+04 7.05e+04 ExterQual_Gd -3.842e+04 7791.062 -4.931 0.000 -5.37e+04 -2.31e+04 ExterQual_Fa -4.328e+04 2.08e+04 -2.077 0.038 -8.42e+04 -2360.979 SaleCondition_Alloca 2.01e+04 1.26e+04 1.597 0.111 -4617.940 4.48e+04 LandContour_Low -3.626e+04 1.36e+04 -2.661 0.008 -6.3e+04 -9494.401 RoofMatl_WdShngl 6.847e+04 1.71e+04 3.997 0.000 3.48e+04 1.02e+05 MasVnrType_Stone 1.271e+04 1.08e+04 1.180 0.238 -8441.470 3.39e+04 LotFrontage 156.3754 67.226 2.326 0.020 24.345 288.406 LotConfig_CulDSac 1.969e+04 6212.621 3.169 0.002 7488.533 3.19e+04 MSSubClass_30 -6150.8261 6753.588 -0.911 0.363 -1.94e+04 7113.035 Neighborhood_Gilbert -1.984e+04 1.09e+04 -1.814 0.070 -4.13e+04 1641.737 Exterior2nd_VinylSd 2997.3119 1.71e+04 0.175 0.861 -3.06e+04 3.66e+04 MSZoning_FV 5.192e+04 1.9e+04 2.733 0.006 1.46e+04 8.92e+04 WoodDeckSF 16.7247 9.502 1.760 0.079 -1.936 35.385 ScreenPorch 32.7795 19.441 1.686 0.092 -5.401 70.960 Exterior1st_WdShing -1.528e+04 1.7e+04 -0.897 0.370 -4.87e+04 1.82e+04 Neighborhood_SawyerW -2235.8362 1.09e+04 -0.205 0.838 -2.37e+04 1.92e+04 Neighborhood_NridgHt -3267.5009 1.03e+04 -0.317 0.751 -2.35e+04 1.7e+04 OverallCond_9 3.341e+04 2.48e+04 1.349 0.178 -1.52e+04 8.21e+04 Condition2_PosA -4.328e+04 5.02e+04 -0.862 0.389 -1.42e+05 5.54e+04 Neighborhood_BrkSide -1.43e+04 1.37e+04 -1.047 0.296 -4.11e+04 1.25e+04 Condition1_Norm 8529.0876 6311.433 1.351 0.177 -3866.393 2.09e+04 BedroomAbvGr 4393.3319 1894.693 2.319 0.021 672.206 8114.458 BsmtQual_TA -1.159e+04 5937.657 -1.952 0.051 -2.33e+04 68.165 BsmtQual_Fa -1.687e+04 9015.455 -1.871 0.062 -3.46e+04 837.097 RoofMatl_Tar&Grv -3239.4401 2.76e+04 -0.117 0.907 -5.75e+04 5.1e+04 RoofStyle_Mansard 4.41e+04 3.46e+04 1.274 0.203 -2.39e+04 1.12e+05 Neighborhood_SWISU -9792.0620 1.42e+04 -0.688 0.492 -3.77e+04 1.81e+04 Neighborhood_CollgCr -1.266e+04 1.01e+04 -1.254 0.210 -3.25e+04 7172.089 BsmtFinType2_BLQ -2.822e+04 1.1e+04 -2.569 0.010 -4.98e+04 -6642.014 BsmtCond_TA 6863.3447 6153.101 1.115 0.265 -5221.177 1.89e+04 Neighborhood_Veenker 745.7354 1.56e+04 0.048 0.962 -3e+04 3.15e+04 Exterior1st_CemntBd -1.436e+04 2.7e+04 -0.531 0.595 -6.74e+04 3.87e+04 Exterior1st_HdBoard -3135.9166 1.67e+04 -0.187 0.852 -3.6e+04 2.98e+04 BsmtFinType2_Rec -1.78e+04 1.02e+04 -1.752 0.080 -3.78e+04 2152.522 Foundation_Wood -5.411e+04 2.16e+04 -2.503 0.013 -9.66e+04 -1.17e+04 GarageType_Basment 6.042e+04 1.98e+04 3.050 0.002 2.15e+04 9.93e+04 Exterior2nd_Plywood -1083.7407 1.56e+04 -0.070 0.945 -3.17e+04 2.95e+04 KitchenQual_Gd -1.718e+04 4592.874 -3.740 0.000 -2.62e+04 -8155.534 KitchenQual_TA -1.634e+04 5253.060 -3.111 0.002 -2.67e+04 -6027.514 KitchenQual_Fa -2.452e+04 8435.283 -2.907 0.004 -4.11e+04 -7955.132 SaleType_ConLD 3.122e+04 1.61e+04 1.944 0.052 -322.157 6.28e+04 MSSubClass_190 -3.106e+04 3.64e+04 -0.853 0.394 -1.03e+05 4.04e+04 Heating_Grav -1.819e+04 1.95e+04 -0.931 0.352 -5.66e+04 2.02e+04 GarageQual_Fa 2.958e+04 3.61e+04 0.818 0.413 -4.14e+04 1.01e+05 3SsnPorch 100.7975 46.202 2.182 0.030 10.057 191.538 Functional_Min2 -2.04e+04 1.31e+04 -1.553 0.121 -4.62e+04 5402.589 MSSubClass_90 -1.386e+04 1.06e+04 -1.302 0.194 -3.48e+04 7052.677 MSSubClass_50 -9178.6194 1.22e+04 -0.750 0.454 -3.32e+04 1.49e+04 Neighborhood_NPkVill -6316.1920 2.09e+04 -0.302 0.762 -4.73e+04 3.47e+04 Neighborhood_BrDale 6847.6949 1.65e+04 0.415 0.678 -2.55e+04 3.92e+04 LotConfig_FR3 -2.077e+04 1.8e+04 -1.151 0.250 -5.62e+04 1.47e+04 Exterior2nd_Other -3.129e+04 3.07e+04 -1.020 0.308 -9.15e+04 2.9e+04 Exterior1st_ImStucc -3.629e+04 3.37e+04 -1.077 0.282 -1.02e+05 2.99e+04 GarageCond_TA -1.475e+04 4.54e+04 -0.325 0.745 -1.04e+05 7.44e+04 Neighborhood_Edwards -2.9e+04 1.12e+04 -2.590 0.010 -5.1e+04 -7007.572 LotConfig_FR2 -4632.7025 6560.036 -0.706 0.480 -1.75e+04 8251.028 PavedDrive_P -8214.6972 9286.426 -0.885 0.377 -2.65e+04 1e+04 SaleCondition_AdjLand 2.183e+04 2.08e+04 1.051 0.294 -1.9e+04 6.26e+04 BsmtCond_Gd 2241.2995 8101.576 0.277 0.782 -1.37e+04 1.82e+04 RoofStyle_Gambrel 4.598e+04 3.37e+04 1.366 0.172 -2.01e+04 1.12e+05 SaleType_Oth 1.321e+04 1.66e+04 0.796 0.427 -1.94e+04 4.58e+04 MSSubClass_85 -1.87e+04 1.61e+04 -1.165 0.245 -5.02e+04 1.28e+04 LotShape_Reg 2100.5104 2529.498 0.830 0.407 -2867.353 7068.374 Functional_Min1 -1.964e+04 1.38e+04 -1.426 0.154 -4.67e+04 7401.649 SaleType_CWD 1.775e+04 1.62e+04 1.096 0.273 -1.41e+04 4.96e+04 Functional_Maj2 -2.949e+04 2e+04 -1.477 0.140 -6.87e+04 9713.068 Exterior2nd_AsphShn -1.572e+04 2.72e+04 -0.578 0.563 -6.91e+04 3.77e+04 Neighborhood_Sawyer -1.123e+04 1.17e+04 -0.963 0.336 -3.42e+04 1.17e+04 Condition1_RRAe -1.439e+04 1.18e+04 -1.221 0.223 -3.75e+04 8754.700 BsmtFinType1_Rec -2881.5038 4605.507 -0.626 0.532 -1.19e+04 6163.585 Heating_Wall -1.47e+04 3.04e+04 -0.483 0.629 -7.44e+04 4.5e+04 HeatingQC_TA -5913.9539 3130.203 -1.889 0.059 -1.21e+04 233.678 Exterior1st_Stucco 1.14e+04 1.99e+04 0.573 0.567 -2.77e+04 5.05e+04 GarageQual_Gd 5.012e+04 3.81e+04 1.314 0.189 -2.48e+04 1.25e+05 GarageType_Detchd 4.671e+04 1.77e+04 2.643 0.008 1.2e+04 8.14e+04 FireplaceQu_Po -1.636e+04 1.16e+04 -1.405 0.161 -3.92e+04 6511.465 Electrical_SBrkr -3164.3123 4386.739 -0.721 0.471 -1.18e+04 5451.123 Exterior1st_Plywood -1577.7946 1.64e+04 -0.096 0.923 -3.38e+04 3.06e+04 RoofMatl_Membran 7.621e+04 4.54e+04 1.678 0.094 -1.3e+04 1.65e+05 Neighborhood_ClearCr -3.309e+04 1.4e+04 -2.358 0.019 -6.07e+04 -5527.750 SaleType_Con 1.276e+04 1.97e+04 0.647 0.518 -2.6e+04 5.15e+04 OpenPorchSF 23.7145 19.005 1.248 0.213 -13.610 61.039 Exterior2nd_Brk Cmn 1.773e+04 2.8e+04 0.633 0.527 -3.72e+04 7.27e+04 Functional_Mod -2.228e+04 1.66e+04 -1.343 0.180 -5.48e+04 1.03e+04 Neighborhood_Somerst -2.698e+04 1.28e+04 -2.105 0.036 -5.21e+04 -1812.959 Neighborhood_Timber -2.142e+04 1.15e+04 -1.868 0.062 -4.39e+04 1094.740 HouseStyle_2.5Fin -1.916e+04 3.08e+04 -0.623 0.534 -7.96e+04 4.13e+04 GarageCond_Fa -1.274e+04 4.51e+04 -0.283 0.778 -1.01e+05 7.58e+04 MSZoning_RL 2.102e+04 1.7e+04 1.234 0.218 -1.24e+04 5.45e+04 FireplaceQu_TA -1.188e+04 8144.896 -1.459 0.145 -2.79e+04 4115.580 Neighborhood_Blueste 1.462e+04 2.4e+04 0.609 0.543 -3.25e+04 6.17e+04 BsmtExposure_Mn -5161.9351 4492.729 -1.149 0.251 -1.4e+04 3661.661 BsmtExposure_No -3146.3313 3243.215 -0.970 0.332 -9515.917 3223.254 MSSubClass_80 -3.103e+04 1.7e+04 -1.825 0.068 -6.44e+04 2357.109 BsmtFinType2_Not Available -5.534e+04 2.7e+04 -2.047 0.041 -1.08e+05 -2242.609 BsmtQual_Not Available 4.116e+04 3.15e+04 1.308 0.191 -2.06e+04 1.03e+05 BsmtFinType2_LwQ -1.492e+04 1.04e+04 -1.430 0.153 -3.54e+04 5576.840 SaleType_ConLw -2.62e+04 2.02e+04 -1.297 0.195 -6.59e+04 1.35e+04 Condition1_PosN 1.838e+04 1.31e+04 1.402 0.161 -7363.443 4.41e+04 MSZoning_RH 3.156e+04 2.06e+04 1.530 0.127 -8957.079 7.21e+04 Condition2_RRNn 1.915e+04 4.52e+04 0.423 0.672 -6.97e+04 1.08e+05 HouseStyle_1Story -1.687e+04 1.19e+04 -1.423 0.155 -4.02e+04 6416.645 MSSubClass_160 -2.627e+04 1.32e+04 -1.996 0.046 -5.21e+04 -415.349 HouseStyle_SFoyer -1.321e+04 1.69e+04 -0.784 0.433 -4.63e+04 1.99e+04 ExterCond_Fa 1.029e+04 2.55e+04 0.404 0.687 -3.98e+04 6.03e+04 Functional_Typ -1.376e+04 1.17e+04 -1.176 0.240 -3.67e+04 9209.758 Exterior2nd_CBlock -1.714e+04 3.43e+04 -0.500 0.617 -8.45e+04 5.02e+04 Neighborhood_MeadowV -6632.8174 1.66e+04 -0.401 0.689 -3.91e+04 2.59e+04 MSSubClass_45 -1.389e+04 1.62e+04 -0.858 0.391 -4.57e+04 1.79e+04 GarageType_Attchd 4.334e+04 1.77e+04 2.442 0.015 8480.819 7.82e+04 GarageType_CarPort 4.716e+04 2.26e+04 2.086 0.037 2765.401 9.16e+04 GarageType_Not Available 7.418e+04 3.33e+04 2.224 0.027 8683.708 1.4e+05 MSZoning_RM 1.777e+04 1.61e+04 1.101 0.271 -1.39e+04 4.95e+04 BsmtFinType1_LwQ -2242.7798 5430.830 -0.413 0.680 -1.29e+04 8423.220 GarageCond_Po 1.721e+04 3.82e+04 0.450 0.653 -5.79e+04 9.23e+04 GarageCond_Gd -1.187e+04 4.82e+04 -0.246 0.806 -1.07e+05 8.28e+04 Exterior2nd_BrkFace -5777.9880 1.78e+04 -0.324 0.746 -4.08e+04 2.92e+04 CentralAir_Y 6352.4333 5670.502 1.120 0.263 -4784.276 1.75e+04 Foundation_Stone 1.503e+04 1.71e+04 0.880 0.379 -1.85e+04 4.86e+04 OverallQual_2 -1.92e+04 4.26e+04 -0.451 0.652 -1.03e+05 6.44e+04 Neighborhood_Mitchel -2.296e+04 1.17e+04 -1.957 0.051 -4.6e+04 79.189 Exterior2nd_ImStucc 5594.7596 2.04e+04 0.275 0.783 -3.44e+04 4.56e+04 Heating_GasW 9018.8912 9742.849 0.926 0.355 -1.01e+04 2.82e+04 Foundation_PConc 2794.4173 5118.928 0.546 0.585 -7259.017 1.28e+04 LotShape_IR3 7617.4981 1.44e+04 0.531 0.596 -2.06e+04 3.58e+04 FireplaceQu_Fa -1.093e+04 1.02e+04 -1.073 0.284 -3.09e+04 9078.362 Foundation_Slab 6343.4858 1.48e+04 0.428 0.669 -2.28e+04 3.55e+04 MasVnrType_Not Available -4304.6510 1.55e+04 -0.278 0.781 -3.47e+04 2.61e+04 OverallQual_5 -3.652e+04 3.77e+04 -0.969 0.333 -1.11e+05 3.75e+04 FireplaceQu_Gd -7949.3806 7857.964 -1.012 0.312 -2.34e+04 7483.445 FireplaceQu_Not Available -7504.2968 9498.229 -0.790 0.430 -2.62e+04 1.11e+04 Condition2_Norm -5383.5224 3.54e+04 -0.152 0.879 -7.5e+04 6.42e+04 OverallQual_3 -3.947e+04 3.76e+04 -1.051 0.294 -1.13e+05 3.43e+04 MSSubClass_180 -1.749e+04 1.82e+04 -0.962 0.336 -5.32e+04 1.82e+04 PavedDrive_Y 3985.3905 5088.038 0.783 0.434 -6007.376 1.4e+04 BsmtHalfBath -1053.8131 4590.969 -0.230 0.819 -1.01e+04 7962.724 MSSubClass_75 -8260.0142 3.59e+04 -0.230 0.818 -7.87e+04 6.22e+04 RoofMatl_Roll -1.416e+04 2.87e+04 -0.493 0.622 -7.06e+04 4.23e+04 GarageQual_TA 3.526e+04 3.66e+04 0.965 0.335 -3.65e+04 1.07e+05 Neighborhood_IDOTRR -1.696e+04 1.55e+04 -1.095 0.274 -4.74e+04 1.35e+04 Electrical_FuseP 1.369e+04 2.26e+04 0.604 0.546 -3.08e+04 5.82e+04 Electrical_Mix 1.973e+04 4.55e+04 0.434 0.665 -6.96e+04 1.09e+05 HeatingQC_Gd -1537.6994 3139.766 -0.490 0.624 -7704.114 4628.715 Neighborhood_NAmes -1.901e+04 1.1e+04 -1.725 0.085 -4.07e+04 2638.468 Neighborhood_NWAmes -1.758e+04 1.17e+04 -1.503 0.133 -4.06e+04 5398.718 Neighborhood_OldTown -2.134e+04 1.36e+04 -1.574 0.116 -4.8e+04 5285.192 MSSubClass_120 -4919.5792 5835.088 -0.843 0.400 -1.64e+04 6540.373 MasVnrType_None 1877.7729 1.03e+04 0.183 0.855 -1.83e+04 2.2e+04 LotShape_IR2 -4161.3813 7913.901 -0.526 0.599 -1.97e+04 1.14e+04 Electrical_FuseF 1640.2051 9871.385 0.166 0.868 -1.77e+04 2.1e+04 LotConfig_Inside 1312.2090 2789.788 0.470 0.638 -4166.857 6791.275 HouseStyle_SLvl 7730.2562 1.87e+04 0.414 0.679 -2.9e+04 4.44e+04 Condition1_RRNn 1.094e+04 2.31e+04 0.473 0.637 -3.45e+04 5.64e+04 Fence_MnWw 7042.9008 1.18e+04 0.595 0.552 -1.62e+04 3.03e+04 RoofStyle_Gable 2.333e+04 3.12e+04 0.747 0.455 -3.8e+04 8.47e+04 RoofStyle_Hip 2.275e+04 3.14e+04 0.725 0.469 -3.89e+04 8.44e+04 Exterior2nd_MetalSd -6478.9963 1.83e+04 -0.355 0.723 -4.24e+04 2.94e+04 Exterior1st_MetalSd 6606.5230 1.87e+04 0.353 0.724 -3.01e+04 4.33e+04 GarageFinish_RFn -1108.8461 2939.761 -0.377 0.706 -6882.455 4664.763 Exterior1st_VinylSd -2319.6144 1.78e+04 -0.130 0.896 -3.73e+04 3.26e+04 SaleCondition_Partial 8491.0255 1.78e+04 0.478 0.633 -2.64e+04 4.34e+04 BsmtFinType1_Unf 2562.9484 4361.698 0.588 0.557 -6003.306 1.11e+04 SaleCondition_Family -2426.8889 9360.041 -0.259 0.796 -2.08e+04 1.6e+04 Heating_OthW 6511.6278 2.17e+04 0.300 0.764 -3.61e+04 4.91e+04 OverallCond_5 -4417.1357 2.04e+04 -0.216 0.829 -4.46e+04 3.57e+04 MiscFeature_Shed 2800.6196 8250.910 0.339 0.734 -1.34e+04 1.9e+04 BsmtFinType1_BLQ 1630.6622 4153.197 0.393 0.695 -6526.102 9787.426 Exterior2nd_Stone -9491.1421 3.1e+04 -0.306 0.760 -7.04e+04 5.14e+04 SaleType_ConLI -4107.2633 1.68e+04 -0.245 0.807 -3.71e+04 2.88e+04 SaleType_WD -1345.2472 5954.210 -0.226 0.821 -1.3e+04 1.03e+04 LandContour_Lvl -2009.9204 6245.764 -0.322 0.748 -1.43e+04 1.03e+04 Fence_GdWo 3860.9759 8073.026 0.478 0.633 -1.2e+04 1.97e+04 BsmtFullBath 1133.4017 2988.423 0.379 0.705 -4735.778 7002.582 Condition1_PosA 4285.8753 1.84e+04 0.232 0.816 -3.19e+04 4.05e+04 Exterior2nd_Wd Shng 2344.0078 1.61e+04 0.146 0.884 -2.93e+04 3.4e+04 MiscVal -1.3167 9.170 -0.144 0.886 -19.326 16.693 ExterCond_Gd 2.347e+04 2.32e+04 1.011 0.313 -2.21e+04 6.91e+04 ExterCond_TA 2.309e+04 2.33e+04 0.991 0.322 -2.27e+04 6.89e+04 Fence_Not Available 2671.0758 6026.979 0.443 0.658 -9165.745 1.45e+04 Fence_MnPrv 2460.5655 6530.466 0.377 0.706 -1.04e+04 1.53e+04 LandContour_HLS -1311.2213 7976.945 -0.164 0.869 -1.7e+04 1.44e+04 HouseStyle_2.5Unf -4611.3336 3.82e+04 -0.121 0.904 -7.97e+04 7.04e+04 HeatingQC_Po 6149.3821 3.09e+04 0.199 0.842 -5.46e+04 6.69e+04 Exterior2nd_Wd Sdng -627.0256 1.57e+04 -0.040 0.968 -3.14e+04 3.01e+04 Condition1_RRNe 3696.5835 2.57e+04 0.144 0.885 -4.67e+04 5.41e+04 Condition2_Feedr 8869.8999 3.89e+04 0.228 0.820 -6.76e+04 8.54e+04 Exterior2nd_Stucco -129.0002 1.79e+04 -0.007 0.994 -3.54e+04 3.51e+04 Condition1_RRAn 1177.6678 1e+04 0.117 0.907 -1.85e+04 2.09e+04 Condition1_Feedr 438.8624 7746.471 0.057 0.955 -1.48e+04 1.57e+04 HouseStyle_2Story 626.6116 1.09e+04 0.057 0.954 -2.08e+04 2.21e+04 OverallQual_6 -3.426e+04 3.77e+04 -0.908 0.364 -1.08e+05 3.99e+04 OverallQual_4 -3.383e+04 3.75e+04 -0.902 0.367 -1.07e+05 3.98e+04 BldgType_2fmCon 7399.6033 3.52e+04 0.210 0.834 -6.17e+04 7.65e+04 MiscFeature_Othr 6572.0973 2.9e+04 0.226 0.821 -5.04e+04 6.36e+04 Exterior1st_BrkComm 6741.5577 3.86e+04 0.175 0.861 -6.91e+04 8.26e+04 HeatingQC_Fa -939.0310 7228.257 -0.130 0.897 -1.51e+04 1.33e+04 MSSubClass_40 -2922.1790 2.09e+04 -0.140 0.889 -4.39e+04 3.81e+04 Alley_Pave 1220.1383 9798.611 0.125 0.901 -1.8e+04 2.05e+04 Exterior1st_Wd Sdng 1981.2473 1.63e+04 0.122 0.903 -3e+04 3.39e+04 OverallCond_4 -2290.6565 2.09e+04 -0.109 0.913 -4.34e+04 3.88e+04 Exterior2nd_HdBoard -1351.0716 1.64e+04 -0.082 0.934 -3.36e+04 3.09e+04 BldgType_Twnhs 666.3158 6918.098 0.096 0.923 -1.29e+04 1.43e+04 Alley_Not Available 414.5243 6497.530 0.064 0.949 -1.23e+04 1.32e+04 GarageFinish_Unf -235.5271 3712.022 -0.063 0.949 -7525.836 7054.782 EnclosedPorch 1.2126 18.825 0.064 0.949 -35.759 38.184 RoofMatl_WdShake 1016.2752 2.72e+04 0.037 0.970 -5.25e+04 5.45e+04 MasVnrType_BrkFace -250.6285 1.03e+04 -0.024 0.981 -2.06e+04 2.01e+04 Exterior2nd_CmentBd 595.0714 2.6e+04 0.023 0.982 -5.05e+04 5.17e+04 PoolQC_Fa 0 0 nan nan 0 0 ============================================================================== Omnibus: 165.594 Durbin-Watson: 2.012 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1631.531 Skew: 0.580 Prob(JB): 0.00 Kurtosis: 9.728 Cond. No. 1.14e+16 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 7.97e-22. This might indicate that there are strong multicollinearity problems or that the design matrix is singular. Selected Features in Forward Stepwise Regression: ['GarageArea', 'ExterQual_TA', 'Fireplaces', 'OverallQual_10', 'OverallQual_9', 'OverallQual_8', 'FullBath', 'BsmtFinSF1', 'Condition2_PosN', 'LotArea', 'HalfBath', 'SaleType_New', 'OverallQual_7', 'Neighborhood_NoRidge', 'BsmtUnfSF', 'GarageType_BuiltIn', 'Neighborhood_Crawfor', 'BsmtFinType2_Unf', 'MSSubClass_60', 'OverallCond_7', 'BsmtExposure_Gd', 'OverallCond_3', 'KitchenAbvGr', 'Neighborhood_StoneBr', 'Exterior1st_BrkFace', 'OverallCond_8', 'PoolArea', 'MasVnrArea', 'PoolQC_Not Available', 'PoolQC_Gd', 'Foundation_CBlock', 'OverallCond_6', 'BsmtFinType1_GLQ', 'BsmtQual_Gd', 'SaleCondition_Normal', 'MSSubClass_70', 'BsmtFinType2_GLQ', 'Street_Pave', 'ExterQual_Gd', 'ExterQual_Fa', 'SaleCondition_Alloca', 'LandContour_Low', 'RoofMatl_WdShngl', 'MasVnrType_Stone', 'LotFrontage', 'LotConfig_CulDSac', 'MSSubClass_30', 'Neighborhood_Gilbert', 'Exterior2nd_VinylSd', 'MSZoning_FV', 'WoodDeckSF', 'ScreenPorch', 'Exterior1st_WdShing', 'Neighborhood_SawyerW', 'Neighborhood_NridgHt', 'OverallCond_9', 'Condition2_PosA', 'Neighborhood_BrkSide', 'Condition1_Norm', 'BedroomAbvGr', 'BsmtQual_TA', 'BsmtQual_Fa', 'RoofMatl_Tar&Grv', 'RoofStyle_Mansard', 'Neighborhood_SWISU', 'Neighborhood_CollgCr', 'BsmtFinType2_BLQ', 'BsmtCond_TA', 'Neighborhood_Veenker', 'Exterior1st_CemntBd', 'Exterior1st_HdBoard', 'BsmtFinType2_Rec', 'Foundation_Wood', 'GarageType_Basment', 'Exterior2nd_Plywood', 'KitchenQual_Gd', 'KitchenQual_TA', 'KitchenQual_Fa', 'SaleType_ConLD', 'MSSubClass_190', 'Heating_Grav', 'GarageQual_Fa', '3SsnPorch', 'Functional_Min2', 'MSSubClass_90', 'MSSubClass_50', 'Neighborhood_NPkVill', 'Neighborhood_BrDale', 'LotConfig_FR3', 'Exterior2nd_Other', 'Exterior1st_ImStucc', 'GarageCond_TA', 'Neighborhood_Edwards', 'LotConfig_FR2', 'PavedDrive_P', 'SaleCondition_AdjLand', 'BsmtCond_Gd', 'RoofStyle_Gambrel', 'SaleType_Oth', 'MSSubClass_85', 'LotShape_Reg', 'Functional_Min1', 'SaleType_CWD', 'Functional_Maj2', 'Exterior2nd_AsphShn', 'Neighborhood_Sawyer', 'Condition1_RRAe', 'BsmtFinType1_Rec', 'Heating_Wall', 'HeatingQC_TA', 'Exterior1st_Stucco', 'GarageQual_Gd', 'GarageType_Detchd', 'FireplaceQu_Po', 'Electrical_SBrkr', 'Exterior1st_Plywood', 'RoofMatl_Membran', 'Neighborhood_ClearCr', 'SaleType_Con', 'OpenPorchSF', 'Exterior2nd_Brk Cmn', 'Functional_Mod', 'Neighborhood_Somerst', 'Neighborhood_Timber', 'HouseStyle_2.5Fin', 'GarageCond_Fa', 'MSZoning_RL', 'FireplaceQu_TA', 'Neighborhood_Blueste', 'BsmtExposure_Mn', 'BsmtExposure_No', 'MSSubClass_80', 'BsmtFinType2_Not Available', 'BsmtQual_Not Available', 'BsmtFinType2_LwQ', 'SaleType_ConLw', 'Condition1_PosN', 'MSZoning_RH', 'Condition2_RRNn', 'HouseStyle_1Story', 'MSSubClass_160', 'HouseStyle_SFoyer', 'ExterCond_Fa', 'Functional_Typ', 'Exterior2nd_CBlock', 'Neighborhood_MeadowV', 'MSSubClass_45', 'GarageType_Attchd', 'GarageType_CarPort', 'GarageType_Not Available', 'MSZoning_RM', 'BsmtFinType1_LwQ', 'GarageCond_Po', 'GarageCond_Gd', 'Exterior2nd_BrkFace', 'CentralAir_Y', 'Foundation_Stone', 'OverallQual_2', 'Neighborhood_Mitchel', 'Exterior2nd_ImStucc', 'Heating_GasW', 'Foundation_PConc', 'LotShape_IR3', 'FireplaceQu_Fa', 'Foundation_Slab', 'MasVnrType_Not Available', 'OverallQual_5', 'FireplaceQu_Gd', 'FireplaceQu_Not Available', 'Condition2_Norm', 'OverallQual_3', 'MSSubClass_180', 'PavedDrive_Y', 'BsmtHalfBath', 'MSSubClass_75', 'RoofMatl_Roll', 'GarageQual_TA', 'Neighborhood_IDOTRR', 'Electrical_FuseP', 'Electrical_Mix', 'HeatingQC_Gd', 'Neighborhood_NAmes', 'Neighborhood_NWAmes', 'Neighborhood_OldTown', 'MSSubClass_120', 'MasVnrType_None', 'LotShape_IR2', 'Electrical_FuseF', 'LotConfig_Inside', 'HouseStyle_SLvl', 'Condition1_RRNn', 'Fence_MnWw', 'RoofStyle_Gable', 'RoofStyle_Hip', 'Exterior2nd_MetalSd', 'Exterior1st_MetalSd', 'GarageFinish_RFn', 'Exterior1st_VinylSd', 'SaleCondition_Partial', 'BsmtFinType1_Unf', 'SaleCondition_Family', 'Heating_OthW', 'OverallCond_5', 'MiscFeature_Shed', 'BsmtFinType1_BLQ', 'Exterior2nd_Stone', 'SaleType_ConLI', 'SaleType_WD', 'LandContour_Lvl', 'Fence_GdWo', 'BsmtFullBath', 'Condition1_PosA', 'Exterior2nd_Wd Shng', 'MiscVal', 'ExterCond_Gd', 'ExterCond_TA', 'Fence_Not Available', 'Fence_MnPrv', 'LandContour_HLS', 'HouseStyle_2.5Unf', 'HeatingQC_Po', 'Exterior2nd_Wd Sdng', 'Condition1_RRNe', 'Condition2_Feedr', 'Exterior2nd_Stucco', 'Condition1_RRAn', 'Condition1_Feedr', 'HouseStyle_2Story', 'OverallQual_6', 'OverallQual_4', 'BldgType_2fmCon', 'MiscFeature_Othr', 'Exterior1st_BrkComm', 'HeatingQC_Fa', 'MSSubClass_40', 'Alley_Pave', 'Exterior1st_Wd Sdng', 'OverallCond_4', 'Exterior2nd_HdBoard', 'BldgType_Twnhs', 'Alley_Not Available', 'GarageFinish_Unf', 'EnclosedPorch', 'RoofMatl_WdShake', 'MasVnrType_BrkFace', 'Exterior2nd_CmentBd', 'PoolQC_Fa']
The number of predictors decreased from previously 264 (after dummification of categorical features) into 246. We will now build another MLR model with the combination of the best features and evaluate the model performance against the test set.
# Subsetting the Predictors using the best combination of features
X_train_sm = X_train[selected_features_forward]
X_test_sm = X_test[selected_features_forward]
model_sm = LinearRegression()
model_sm.fit(X_train_sm,y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# Creating Prediction on test set
y_pred_sm = model_sm.predict(X_test_sm)
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_sm)
mse = mean_squared_error(y_test, y_pred_sm)
rmse = mean_squared_error(y_test, y_pred_sm, squared=False)
r2 = r2_score(y_test, y_pred_sm)
# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 24104.967342728156 Mean Squared Error (MSE): 3408762712.2648325 Root Mean Squared Error (RMSE): 58384.610234759915 R-squared (R²): 0.47485534606607094
The MLR model built using features as a result from feature selection of stepwise regression using statsmodel package returned a relatively same model performance. R-Squared value increased only around 0.1% from the model without stepwise regression feature selection.
We will now try to use mlextend package to perform feature selection and compare the model performance.
6.2 Forward Stepwise Regression using mlextend
Package
The MLR model built using the features selected from the Stepwise Regression using mlextend
package will be referred to as model_ml
**Please note that below code cell takes approximately 893 seconds to run ~ 15 mins**
# Record the start time
start_time = time.time()
# Stepwise forward feature selection
model_forward = LinearRegression()
sfs_forward = SequentialFeatureSelector(model_forward,
k_features='best',
forward=True,
scoring='r2',
cv=5)
sfs_forward.fit(X_train, y_train)
selected_features_forward = list(X_train.columns[list(sfs_forward.k_feature_idx_)])
print("Forward Stepwise Regression:")
print("Selected Features:", selected_features_forward)
# Record the end time
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print("")
print(f"Elapsed time: {elapsed_time:.6f} seconds")
Forward Stepwise Regression: Selected Features: ['const', 'LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageArea', 'WoodDeckSF', '3SsnPorch', 'ScreenPorch', 'MSZoning_FV', 'MSZoning_RL', 'Street_Pave', 'LandContour_Low', 'LotConfig_CulDSac', 'LotConfig_FR2', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAn', 'Condition2_Norm', 'Condition2_PosN', 'Condition2_RRNn', 'BldgType_TwnhsE', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2Story', 'HouseStyle_SLvl', 'RoofStyle_Mansard', 'RoofMatl_WdShngl', 'Exterior1st_BrkFace', 'Exterior1st_Wd Sdng', 'Exterior1st_WdShing', 'Exterior2nd_Brk Cmn', 'Exterior2nd_CmentBd', 'Exterior2nd_Other', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'ExterCond_Fa', 'Foundation_PConc', 'Foundation_Wood', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType2_BLQ', 'BsmtFinType2_GLQ', 'BsmtFinType2_Unf', 'Heating_Grav', 'Heating_Wall', 'HeatingQC_Gd', 'HeatingQC_Po', 'HeatingQC_TA', 'Electrical_FuseF', 'Electrical_Mix', 'KitchenQual_Gd', 'Functional_Min1', 'FireplaceQu_Not Available', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Not Available', 'GarageFinish_Not Available', 'Fence_MnWw', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_New', 'SaleCondition_Normal', 'MSSubClass_30', 'MSSubClass_85', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_190', 'OverallQual_6', 'OverallQual_7', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7'] Elapsed time: 893.863827 seconds
len(selected_features_forward)
104
The number of predictors decreased from previously 264 (after dummification of categorical features) into 103. We will now build another MLR model with the combination of the best features and evaluate the model performance against the test set.
features = selected_features_forward[1:]
X_train_ml = X_train[features]
X_test_ml = X_test[features]
model_ml = LinearRegression()
model_ml.fit(X_train_ml,y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# Creating Prediction on test set
y_pred_ml = model_ml.predict(X_test_ml)
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_ml)
mse = mean_squared_error(y_test, y_pred_ml)
rmse = mean_squared_error(y_test, y_pred_ml, squared=False)
r2 = r2_score(y_test, y_pred_ml)
# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 22863.972546783436 Mean Squared Error (MSE): 2491410398.651759 Root Mean Squared Error (RMSE): 49914.03007824312 R-squared (R²): 0.6161801327795906
The features resulted from the mlextend
package stepwise regression returned an even higher R-squared value from the base MLR and the statsmodel feature selection MLR with around 32% increase from the base MLR and 31% from the statsmodel feature selection MLR.
Other than the feature selection using stepwise regression and VIF that previously discussed, we will now try to fit another MLR model using Polynomial of degree 2 using all of the features. This is due to the fact that there are couples of numerical predictor that seemingly have quadratic relationship with the SalePrice. This model will be referred to as model_pl
.
7.1 Preparing Dataset for Polynomial MLR
quadratic = PolynomialFeatures(2, include_bias=False)
X_train_quadratic = quadratic.fit_transform(X_train.drop(columns='const'))
X_test_quadratic = quadratic.transform(X_test)
7.2 Fitting the Linear Model on the Transformed Features
model_pl = LinearRegression()
model_pl.fit(X_train_quadratic, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
7.3 Evaluating the Model Performance on Test Set
y_pred_pl = model_pl.predict(X_test_quadratic)
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred_pl)
mse = mean_squared_error(y_test, y_pred_pl)
rmse = mean_squared_error(y_test, y_pred_pl, squared=False)
r2 = r2_score(y_test, y_pred_pl)
# Display the performance metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)
Mean Absolute Error (MAE): 78113.00830424526 Mean Squared Error (MSE): 195076371198.62335 Root Mean Squared Error (RMSE): 441674.5082055601 R-squared (R²): -29.05293183805185
The model_pl
fitted using Polynomial of Degree 2 MLR does not perform very well indicated by the much higher error compared to the previous three models. It might implies that the features and its relationship with the response could not be well explained using the Polynomial of Degree 2 (quadratic).
We have evaluated the performance of three different regression models: base_model
, model_sm
, and model_ml
. Here is a summary of their performance metrics:
Model | Mean Absolute Error (MAE) | Mean Squared Error (MSE) | Root Mean Squared Error (RMSE) | R-squared (R²) |
---|---|---|---|---|
base_model | 24125.98 | 3411215062.30 | 58405.61 | 0.47 |
model_sm | 24104.97 | 3408762712.26 | 58384.61 | 0.47 |
model_ml | 22863.97 | 2491410398.65 | 49914.03 | 0.62 |
model_pl | 78113.00 | 195076371198.6 | 441674.50 | -29.0 |
Among these models, the model_ml
exhibits the lowest Mean Absolute Error and Root Mean Squared Error, along with the highest R-squared value. This suggests that the model_ml
performs better compared to the other two models in terms of predictive accuracy.
The features used in by the best model are as follow:
best_features = selected_features_forward[1:]
print(best_features)
['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageArea', 'WoodDeckSF', '3SsnPorch', 'ScreenPorch', 'MSZoning_FV', 'MSZoning_RL', 'Street_Pave', 'LandContour_Low', 'LotConfig_CulDSac', 'LotConfig_FR2', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAn', 'Condition2_Norm', 'Condition2_PosN', 'Condition2_RRNn', 'BldgType_TwnhsE', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2Story', 'HouseStyle_SLvl', 'RoofStyle_Mansard', 'RoofMatl_WdShngl', 'Exterior1st_BrkFace', 'Exterior1st_Wd Sdng', 'Exterior1st_WdShing', 'Exterior2nd_Brk Cmn', 'Exterior2nd_CmentBd', 'Exterior2nd_Other', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'ExterCond_Fa', 'Foundation_PConc', 'Foundation_Wood', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType2_BLQ', 'BsmtFinType2_GLQ', 'BsmtFinType2_Unf', 'Heating_Grav', 'Heating_Wall', 'HeatingQC_Gd', 'HeatingQC_Po', 'HeatingQC_TA', 'Electrical_FuseF', 'Electrical_Mix', 'KitchenQual_Gd', 'Functional_Min1', 'FireplaceQu_Not Available', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Not Available', 'GarageFinish_Not Available', 'Fence_MnWw', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_New', 'SaleCondition_Normal', 'MSSubClass_30', 'MSSubClass_85', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_190', 'OverallQual_6', 'OverallQual_7', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7']
We will now try to predict the House Sale Price using the feature information given in the test set.
9.1 Importing and Adjusting Dataset
test_set = pd.read_csv('test.csv')
test_set.set_index('Id', inplace=True)
test_set[['MSSubClass', 'OverallQual', 'OverallCond']] = test_set[['MSSubClass', 'OverallQual', 'OverallCond']].astype('object')
9.2 Handling Missing Values
numerical_columns = test_set.select_dtypes(include=['int64', 'float64'])
categorical_columns = test_set.select_dtypes(include=['object'])
categorical_columns.isna().sum()
MSSubClass 0 MSZoning 4 Street 0 Alley 1352 LotShape 0 LandContour 0 Utilities 2 LotConfig 0 LandSlope 0 Neighborhood 0 Condition1 0 Condition2 0 BldgType 0 HouseStyle 0 OverallQual 0 OverallCond 0 RoofStyle 0 RoofMatl 0 Exterior1st 1 Exterior2nd 1 MasVnrType 16 ExterQual 0 ExterCond 0 Foundation 0 BsmtQual 44 BsmtCond 45 BsmtExposure 44 BsmtFinType1 42 BsmtFinType2 42 Heating 0 HeatingQC 0 CentralAir 0 Electrical 0 KitchenQual 1 Functional 2 FireplaceQu 730 GarageType 76 GarageFinish 78 GarageQual 78 GarageCond 78 PavedDrive 0 PoolQC 1456 Fence 1169 MiscFeature 1408 SaleType 1 SaleCondition 0 dtype: int64
not_available = ['Alley','BsmtQual','BsmtCond','BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType','GarageFinish', 'GarageQual', 'GarageCond','PoolQC', 'Fence', 'MiscFeature']
for col in not_available:
categorical_columns[col].fillna('Not Available', inplace=True)
categorical_imputer = SimpleImputer(strategy='most_frequent')
real_missing = ['MSZoning', 'Utilities', 'Exterior1st','Exterior2nd', 'MasVnrType', 'KitchenQual', 'Functional', 'SaleType']
categorical_columns[real_missing] = categorical_imputer.fit_transform(categorical_columns[real_missing])
for col in categorical_columns:
num_categories = categorical_columns[col].unique()
print(f"Column '{col}' has unique categories: {num_categories}")
Column 'MSSubClass' has unique categories: [20 60 120 160 80 30 50 90 85 190 45 70 75 180 40 150] Column 'MSZoning' has unique categories: ['RH' 'RL' 'RM' 'FV' 'C (all)'] Column 'Street' has unique categories: ['Pave' 'Grvl'] Column 'Alley' has unique categories: ['Not Available' 'Pave' 'Grvl'] Column 'LotShape' has unique categories: ['Reg' 'IR1' 'IR2' 'IR3'] Column 'LandContour' has unique categories: ['Lvl' 'HLS' 'Bnk' 'Low'] Column 'Utilities' has unique categories: ['AllPub'] Column 'LotConfig' has unique categories: ['Inside' 'Corner' 'FR2' 'CulDSac' 'FR3'] Column 'LandSlope' has unique categories: ['Gtl' 'Mod' 'Sev'] Column 'Neighborhood' has unique categories: ['NAmes' 'Gilbert' 'StoneBr' 'BrDale' 'NPkVill' 'NridgHt' 'Blmngtn' 'NoRidge' 'Somerst' 'SawyerW' 'Sawyer' 'NWAmes' 'OldTown' 'BrkSide' 'ClearCr' 'SWISU' 'Edwards' 'CollgCr' 'Crawfor' 'Blueste' 'IDOTRR' 'Mitchel' 'Timber' 'MeadowV' 'Veenker'] Column 'Condition1' has unique categories: ['Feedr' 'Norm' 'PosN' 'RRNe' 'Artery' 'RRNn' 'PosA' 'RRAn' 'RRAe'] Column 'Condition2' has unique categories: ['Norm' 'Feedr' 'PosA' 'PosN' 'Artery'] Column 'BldgType' has unique categories: ['1Fam' 'TwnhsE' 'Twnhs' 'Duplex' '2fmCon'] Column 'HouseStyle' has unique categories: ['1Story' '2Story' 'SLvl' '1.5Fin' 'SFoyer' '2.5Unf' '1.5Unf'] Column 'OverallQual' has unique categories: [5 6 8 7 4 9 2 3 10 1] Column 'OverallCond' has unique categories: [6 5 7 8 2 9 3 4 1] Column 'RoofStyle' has unique categories: ['Gable' 'Hip' 'Gambrel' 'Flat' 'Mansard' 'Shed'] Column 'RoofMatl' has unique categories: ['CompShg' 'Tar&Grv' 'WdShake' 'WdShngl'] Column 'Exterior1st' has unique categories: ['VinylSd' 'Wd Sdng' 'HdBoard' 'Plywood' 'MetalSd' 'CemntBd' 'WdShing' 'BrkFace' 'AsbShng' 'BrkComm' 'Stucco' 'AsphShn' 'CBlock'] Column 'Exterior2nd' has unique categories: ['VinylSd' 'Wd Sdng' 'HdBoard' 'Plywood' 'MetalSd' 'Brk Cmn' 'CmentBd' 'ImStucc' 'Wd Shng' 'AsbShng' 'Stucco' 'CBlock' 'BrkFace' 'AsphShn' 'Stone'] Column 'MasVnrType' has unique categories: ['None' 'BrkFace' 'Stone' 'BrkCmn'] Column 'ExterQual' has unique categories: ['TA' 'Gd' 'Ex' 'Fa'] Column 'ExterCond' has unique categories: ['TA' 'Gd' 'Fa' 'Po' 'Ex'] Column 'Foundation' has unique categories: ['CBlock' 'PConc' 'BrkTil' 'Stone' 'Slab' 'Wood'] Column 'BsmtQual' has unique categories: ['TA' 'Gd' 'Ex' 'Fa' 'Not Available'] Column 'BsmtCond' has unique categories: ['TA' 'Po' 'Fa' 'Gd' 'Not Available'] Column 'BsmtExposure' has unique categories: ['No' 'Gd' 'Mn' 'Av' 'Not Available'] Column 'BsmtFinType1' has unique categories: ['Rec' 'ALQ' 'GLQ' 'Unf' 'BLQ' 'LwQ' 'Not Available'] Column 'BsmtFinType2' has unique categories: ['LwQ' 'Unf' 'Rec' 'BLQ' 'GLQ' 'ALQ' 'Not Available'] Column 'Heating' has unique categories: ['GasA' 'GasW' 'Grav' 'Wall'] Column 'HeatingQC' has unique categories: ['TA' 'Gd' 'Ex' 'Fa' 'Po'] Column 'CentralAir' has unique categories: ['Y' 'N'] Column 'Electrical' has unique categories: ['SBrkr' 'FuseA' 'FuseF' 'FuseP'] Column 'KitchenQual' has unique categories: ['TA' 'Gd' 'Ex' 'Fa'] Column 'Functional' has unique categories: ['Typ' 'Min2' 'Min1' 'Mod' 'Maj1' 'Sev' 'Maj2'] Column 'FireplaceQu' has unique categories: ['Not Available' 'TA' 'Gd' 'Po' 'Fa' 'Ex'] Column 'GarageType' has unique categories: ['Attchd' 'Detchd' 'BuiltIn' 'Not Available' 'Basment' '2Types' 'CarPort'] Column 'GarageFinish' has unique categories: ['Unf' 'Fin' 'RFn' 'Not Available'] Column 'GarageQual' has unique categories: ['TA' 'Not Available' 'Fa' 'Gd' 'Po'] Column 'GarageCond' has unique categories: ['TA' 'Not Available' 'Fa' 'Gd' 'Po' 'Ex'] Column 'PavedDrive' has unique categories: ['Y' 'N' 'P'] Column 'PoolQC' has unique categories: ['Not Available' 'Ex' 'Gd'] Column 'Fence' has unique categories: ['MnPrv' 'Not Available' 'GdPrv' 'GdWo' 'MnWw'] Column 'MiscFeature' has unique categories: ['Not Available' 'Gar2' 'Shed' 'Othr'] Column 'SaleType' has unique categories: ['WD' 'COD' 'New' 'ConLD' 'Oth' 'Con' 'ConLw' 'ConLI' 'CWD'] Column 'SaleCondition' has unique categories: ['Normal' 'Partial' 'Abnorml' 'Family' 'Alloca' 'AdjLand']
numerical_columns.isna().sum().sort_values(ascending=False)
LotFrontage 227 GarageYrBlt 78 MasVnrArea 15 BsmtHalfBath 2 BsmtFullBath 2 BsmtFinSF1 1 BsmtFinSF2 1 BsmtUnfSF 1 TotalBsmtSF 1 GarageCars 1 GarageArea 1 MoSold 0 WoodDeckSF 0 OpenPorchSF 0 3SsnPorch 0 EnclosedPorch 0 MiscVal 0 Fireplaces 0 PoolArea 0 ScreenPorch 0 HalfBath 0 TotRmsAbvGrd 0 KitchenAbvGr 0 BedroomAbvGr 0 LotArea 0 FullBath 0 GrLivArea 0 LowQualFinSF 0 2ndFlrSF 0 1stFlrSF 0 YearRemodAdd 0 YearBuilt 0 YrSold 0 dtype: int64
continous = ['LotFrontage','BsmtFinSF1','MasVnrArea','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF']
numerical_imputer = SimpleImputer(strategy='mean')
numerical_columns[continous] = numerical_imputer.fit_transform(numerical_columns[continous])
discrete = ['GarageYrBlt','BsmtHalfBath','BsmtFullBath','GarageCars','GarageArea']
numerical_imputer = SimpleImputer(strategy='most_frequent')
numerical_columns[discrete] = numerical_imputer.fit_transform(numerical_columns[discrete])
9.3 Preparing Dataset for Model Prediction
numerical_columns.reset_index(drop=True, inplace=True)
categorical_columns.reset_index(drop=True, inplace=True)
X_test = pd.concat([numerical_columns, categorical_columns], axis=1)
X_test = pd.get_dummies(X_test, drop_first=True)
# Identify missing columns from best_features in the test_set
missing_columns = [col for col in best_features if col not in X_test.columns]
# Add missing columns with zero values
for col in missing_columns:
X_test[col] = 0
X_test=X_test[best_features]
9.4 Predicting House Sale Price in Ames Test Set
y_pred = model_ml.predict(X_test)
price_prediction = pd.DataFrame({'Id': test_set.index, 'SalePrice': y_pred.tolist()})
price_prediction.to_csv('price_prediction.csv', index=False)
Based on the model development, it is apparent that Multiple Linear Regression fitted on the training set using mlextend
package that is scikit learning friendly returned the highest performance indicated by the highest R-Squared as well as the lowest Error measures compared to the other models.
The features used in the best model are as follows and were used to construct the predictive model to predict the house sale price in Ames, Iowa.
['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageArea', 'WoodDeckSF', '3SsnPorch', 'ScreenPorch', 'MSZoning_FV', 'MSZoning_RL', 'Street_Pave', 'LandContour_Low', 'LotConfig_CulDSac', 'LotConfig_FR2', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAn', 'Condition2_Norm', 'Condition2_PosN', 'Condition2_RRNn', 'BldgType_TwnhsE', 'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2Story', 'HouseStyle_SLvl', 'RoofStyle_Mansard', 'RoofMatl_WdShngl', 'Exterior1st_BrkFace', 'Exterior1st_Wd Sdng', 'Exterior1st_WdShing', 'Exterior2nd_Brk Cmn', 'Exterior2nd_CmentBd', 'Exterior2nd_Other', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'ExterCond_Fa', 'Foundation_PConc', 'Foundation_Wood', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 'BsmtFinType1_BLQ', 'BsmtFinType1_GLQ', 'BsmtFinType1_LwQ', 'BsmtFinType2_BLQ', 'BsmtFinType2_GLQ', 'BsmtFinType2_Unf', 'Heating_Grav', 'Heating_Wall', 'HeatingQC_Gd', 'HeatingQC_Po', 'HeatingQC_TA', 'Electrical_FuseF', 'Electrical_Mix', 'KitchenQual_Gd', 'Functional_Min1', 'FireplaceQu_Not Available', 'GarageType_BuiltIn', 'GarageType_CarPort', 'GarageType_Not Available', 'GarageFinish_Not Available', 'Fence_MnWw', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_New', 'SaleCondition_Normal', 'MSSubClass_30', 'MSSubClass_85', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_190', 'OverallQual_6', 'OverallQual_7', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7']