California School Dataset Exploration

Introduction

The California School Dataset Exploration repository is a comprehensive analysis of the R Ecdat package’s dataset, containing information on California public schools in 1998. This dataset comprises data from 420 schools, including variables on student test scores, class sizes, teacher experience and education, and various school characteristics.

The dataset is commonly used in educational research to investigate the factors influencing student achievement and evaluate the effectiveness of policies and interventions. It also serves as a valuable teaching tool for statistics and econometrics courses, providing real-world data to illustrate statistical and modeling techniques.

The primary objective of this repository is to explore the relationship between student test scores in California public schools and several factors using the Multiple Linear Regression method.

Importing Relevant Libraries

Before delving into the California School dataset in Ecdat library, below relevant libraries need to be imported first:

library(tidyverse)
library(Ecdat)
library(forecast)
library(reshape2)
library(MASS)
library(visualize)

Exploring the Dataset

Importing and Checking the Dataset Structure

data <- Caschool
str(data)

## 'data.frame':    420 obs. of  17 variables:
##  $ distcod : int  75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
##  $ county  : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
##  $ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
##  $ grspan  : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
##  $ enrltot : int  195 240 1550 243 1335 137 195 888 379 2247 ...
##  $ teachers: num  10.9 11.1 82.9 14 71.5 ...
##  $ calwpct : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ mealpct : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computer: int  67 101 169 85 171 25 28 66 35 0 ...
##  $ testscr : num  691 661 644 648 641 ...
##  $ compstu : num  0.344 0.421 0.109 0.35 0.128 ...
##  $ expnstu : num  6385 5099 5502 7102 5236 ...
##  $ str     : num  17.9 21.5 18.7 17.4 18.7 ...
##  $ avginc  : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ elpct   : num  0 4.58 30 0 13.86 ...
##  $ readscr : num  692 660 636 652 642 ...
##  $ mathscr : num  690 662 651 644 640 ...

Based on an examination of the dataset structure and a careful comparison with the dataset description, the classification of the variables are as follow:

1. Numeric Variables

These variables represent quantitative measurements and include enrltot, teachers, calwpct, mealpct, computer, testscr, compstu, expnstu, str, avginc, elpct, readscr, and matchscr.

2. Categorical Variables

These variables capture qualitative attributes and include distcod, county, district, and grspan.

This classification allows for a clear understanding of the dataset’s structure and facilitates further analysis and exploration.

Filtering the dataset to only contain rows from the top 16 county

For the factor analysis using Multiple Linear Regression to determine which factors contributes to the student test score, the analysis will only focus on the top 16 county within the dataset.

# Creating list of 16 most common counties
top_county <- data %>%
  count(county, sort=TRUE)
top_county <- top_county[1:16,1]

# Filtering dataset to only include properties of 16 most common counties
data_filtered <- data %>%
  filter(county %in% top_county)

# Checking whether the filtered dataset has only contained variables from only 16 most common counties
unique(data_filtered$county)

##  [1] Fresno         Kern           Merced         Tulare         Los Angeles   
##  [6] San Diego      San Bernardino San Mateo      Santa Clara    Santa Barbara 
## [11] Orange         Sonoma         Humboldt       Shasta         El Dorado     
## [16] Placer        
## 45 Levels: Alameda Butte Calaveras Contra Costa El Dorado Fresno ... Yuba

Top 16 counties in the dataset are: Sonoma, Kern, Los Angeles, Tulare, San Diego, Santa Clara, Humboldt, San Mateo, Shasta, Fresno, Merced, Orange, Placer, Santa Barbara, El Dorado, and San Bernardino.

Partitioning the filtered dataset

Partitioning the dataset is essential because it helps prevent overfitting, evaluate model performance, provide an unbiased sample, and support reproducibility. Overfitting occurs when a model is trained to fit the noise/random effect in the data rather than the underlying patterns or the real effect. Partitioning the data helps prevent this by allowing us to train the model on one subset of the data and validate it on another. Partitioning the data also ensures that the model is tested on data that it has not seen during training to ensure that it is generalizing well. Moreover, if we do not partition the data and use the entire dataset for analysis, we risk introducing biases into our results. Partitioning the data provides an unbiased sample to work with, ensuring the results are valid and reliable.

# Setting seed for reproducibility
set.seed(250)

# Random sampling the dataset index without replacement with 60% for training set
train_index <- sample(c(1:nrow(data_filtered)), nrow(data_filtered)*0.6) 

# Partition the dataset into training and validation set based on the index sampling
train_df <- data_filtered[train_index, ]
valid_df <- data_filtered[-train_index, ]

# Resetting the index for both train_df and valid_df for ease of subsetting
rownames(train_df) <- NULL
rownames(valid_df) <- NULL

# Ensuring the partitioned dataset has been properly set
paste("number of rows for original dataset:", nrow(data_filtered))

## [1] "number of rows for original dataset: 271"

paste("number of rows for training set:", nrow(train_df))

## [1] "number of rows for training set: 162"

paste("number of rows for validation set:", nrow(valid_df))

## [1] "number of rows for validation set: 109"

Exploring the relationship between readscr and mealpct on a scatterplot

It might be interesting to check the relationship between the student read score to the percentage of the free meal.

ggplot(train_df, mapping=aes(x=mealpct, y=readscr))+
  geom_point(color='steelblue', size=3)+
  geom_smooth(method = "lm", se = FALSE, color='yellow2')+
  theme_light()+
  ggtitle("Scatterplot of Avg Reading Score vs % Reduced/Free Lunch Student") +
  xlab("mealpct") + 
  ylab("readscr")

paste("Correlation between readscr and mealpct:", round(cor(train_df$readscr, train_df$mealpct),3))

## [1] "Correlation between readscr and mealpct: -0.897"

The scatterplot between readscr (average reading score) and mealpct (percentage of students on free and reduced lunch) reveals a negative linear relationship. As the percentage of students on free and reduced lunch increases, the average reading score tends to decrease. The trend line overlayed on the scatterplot provides a visual representation of this relationship, highlighting the overall trend in the data.

Interestingly, my initial expectations were opposite to these findings. I came across a health article discussing how students who have lunch may actually experience an increase in their overall scores. However, upon further investigation, it started to make sense that mealpct could serve as a proxy for poverty. A higher number of students eligible for free/reduced lunch could indicate a greater concentration of families experiencing poverty within that school or district. I intend to conduct additional research to explore this further and test my hypothesis.

It is important to note that correlation does not imply causation. While there appears to be a correlation between mealpct and lower average reading scores, it is crucial to investigate other factors and conduct more comprehensive research to draw conclusive insights.

Calculating correlation value between readscr and mealpct

# Calculating correlation value
paste("Correlation between readscr and mealpct:", round(cor(x = train_df$readscr, y = train_df$mealpct, use = "everything", method = c("pearson", "kendall", "spearman")),3))

## [1] "Correlation between readscr and mealpct: -0.897"

# Checking the correlation significance
cor.test(x=train_df$readscr, y=train_df$mealpct,
         alternative = c("two.sided", "less", "greater"),
         method = c("pearson", "kendall", "spearman"),
         exact = NULL, conf.level = 0.95, continuity = FALSE)

## 
##  Pearson's product-moment correlation
## 
## data:  train_df$readscr and train_df$mealpct
## t = -25.719, df = 160, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9237243 -0.8624836
## sample estimates:
##        cor 
## -0.8973408

The correlation between readscr and mealpct is calculated to be at -0.897 indicating a strong negative relationship between those two variables.

The correlation is also significant at alpha level of 0.05 indicated by the p-value of less than 0.05. This imply that we can safely reject the null hypothesis that the true correlation is equal to zero.

Developing Simple Linear Regression (SLR) between readscr and mealpct

model_slr <- lm(readscr~mealpct, data=train_df)
summary(model_slr)

## 
## Call:
## lm(formula = readscr ~ mealpct, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.444  -5.564  -0.221   5.685  20.250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 683.54413    1.33555  511.81   <2e-16 ***
## mealpct      -0.64486    0.02507  -25.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.816 on 160 degrees of freedom
## Multiple R-squared:  0.8052, Adjusted R-squared:  0.804 
## F-statistic: 661.4 on 1 and 160 DF,  p-value: < 2.2e-16

Finding the observation on the model that generated the highest residuals and compare it with the actual district average reading score

# Checking which observation with the highest residual
unname(which.max(model_slr$residuals))

## [1] 73

# Checking the model residual for that particular observation
unname(model_slr$residuals[which.max(model_slr$residuals)]) %>% round(2)

## [1] 20.25

The index of the observation with the highest residual is 73, this index will then be used to check for the row with that specific observation.

# Checking for the observation with the highest residual average reading score
paste("The average reading score for the data with the highest residual in the model_slr:", round(train_df[73,"readscr"], 2))

## [1] "The average reading score for the data with the highest residual in the model_slr: 661.1"

# Checking for the fitted value for that particular observation
paste("Fitted value for observation with highest residual:",round(unname(model_slr$fitted.values[73]),2))

## [1] "Fitted value for observation with highest residual: 640.85"

# Calculating the residuals for that particular observation
highest_res <- (train_df[73,"readscr"]) - (model_slr$fitted.values[73])
paste("The highest residual in the model_slr is:", round(unname(highest_res),2))

## [1] "The highest residual in the model_slr is: 20.25"

The observation with the highest residual within the model_slr occurred at district Ballico - Cressey Elementary, Merced county. The above code returns the district’s average reading score to be around 661,1 with the model fitted value at 640,85. The residual could be calculated by subtracting the district`s average reading score with the fitted value resulting in the residual of 20.25 that matches the residual calculated in the residual summary of the model.

Finding the observation on the model that generated the lowest residuals and compare it with the actual district average reading score

# Checking which observation with the lowest residual
unname(which.min(model_slr$residuals))

## [1] 41

# Checking the model residual for that particular observation
unname(model_slr$residuals[which.min(model_slr$residuals)]) %>% round(2)

## [1] -27.44

The index of the observation with the lowest residual is 41, this index will then be used to check for the row with that specific observation.

# Checking for the observation with the lowest residual average reading score
paste("The average reading score for the data with the lowest residual in the model_slr:", round(train_df[41,"readscr"], 2))

## [1] "The average reading score for the data with the lowest residual in the model_slr: 656.1"

# Checking for the fitted value for that particular observation
paste("Fitted value for observation with highest residual:",round(unname(model_slr$fitted.values[41]),2))

## [1] "Fitted value for observation with highest residual: 683.54"

# Calculating the residuals for that particular observation
lowest_res <- (train_df[41,"readscr"]) - (model_slr$fitted.values[41])
paste("The lowest residual in the model_slr is:", round(unname(lowest_res),2))

## [1] "The lowest residual in the model_slr is: -27.44"

The observation with the lowest residual within the model_slr occurred at district Montgomery Elementary, Sonoma county. The above code returns the district’s average reading score to be around 656,1 with the model fitted value at 683.54. The residual could be calculated by subtracting the district`s average reading score with the fitted value resulting in the residual of -27.44 that matches the residual calculated in the residual summary of the model.

mealpct may not perfectly predict the average reading score readscr as shown by the model residuals where some of the values exhibited very high (positive) residuals and very low (negative) residuals as shown below:

plot(model_slr, which=1)

This might happen because at some school district, mealpct might not be representative of the actual income condition of the student studying there. There might be higher or lower than it supposed to be percentage of student that is eligible for the free or reduced price lunch.

Single Linear Regression Equation

Based on the summary of model_slr, the regression equation would be (rounded up to only two decimal values):

readscr = 683.54 - 0.64*mealpct + error term

The error term represents the difference between the predicted value of the dependent variable and the actual value of the dependent variable.

The equation model above describe that for every one unit increase in mealpct variable value (in this case is the percentage/1%), the readscr variable would be decreasing by 0.64 points given the other variable constant (which is irrelevant for this single linear regression case since it only involve one variable).

By using the above linear regression equation, i would plug the first row in my train_df$mealpct observation as the input and describe the meaning of the output. Below code was used to predict the output of a single input using model_slr:

# Assigning sample point for prediction
sample_point <- train_df$mealpct[1]
paste("Sample point:", sample_point %>% round(2))

## [1] "Sample point: 19.83"

# Checking the sample point actual average reading score
paste("Sample point actual average reading score:", train_df$readscr[1] %>% round(2))

## [1] "Sample point actual average reading score: 669.8"

# Checking the sample point residual in the model_slr
paste("Sample point residual/error term:", unname(model_slr$residuals[1]) %>% round(2))

## [1] "Sample point residual/error term: -0.96"

# Plugging sample point into regression equation to get the predicted output
new_value_pred <- predict(model_slr, data.frame(mealpct = sample_point))
paste("Prediction of the average reading score:", new_value_pred %>% round(2))

## [1] "Prediction of the average reading score: 670.76"

Plugging the value of percentage of student getting free/reduced meal (mealpct) of 19.83 into the regression equation of readscr = 683.54 - 0.64*mealpct + ɛ returned the predicted value of average reading score of 670.76.

Mathematically it could be written as readscr = 683.54 - 0.64*19.83 - 0.96, which return 669.88 (done in calculator) which is quite close with the output from predict function above and the difference might be caused by rounding error.

Assessing the accuracy of the model_slr, using `accuracy()` function from `forecast` library

# Calculating the accuracy of the prediction made on the training dataset
train_predicted <- predict(model_slr, train_df)
accuracy_train_predicted <- data.frame(accuracy(train_predicted, train_df$readscr))
accuracy_train_predicted

##                     ME     RMSE      MAE         MPE     MAPE
## Test set -3.768505e-13 8.761614 6.961201 -0.01782601 1.062598

# Calculating the accuracy of the prediction made on the validation dataset
valid_predicted <- predict(model_slr, valid_df)
accuracy_valid_predicted <- data.frame(accuracy(valid_predicted, valid_df$readscr))
accuracy_valid_predicted

##                   ME    RMSE      MAE         MPE     MAPE
## Test set -0.04492851 9.16115 7.394335 -0.03206485 1.132964

# Combining accuracy output from training and validation dataset for better view and comparison
model_accuracy <- rbind(accuracy_train_predicted, accuracy_valid_predicted)
rownames(model_accuracy) <- c("Training Dataset", "Validation Dataset")
model_accuracy

##                               ME     RMSE      MAE         MPE     MAPE
## Training Dataset   -3.768505e-13 8.761614 6.961201 -0.01782601 1.062598
## Validation Dataset -4.492851e-02 9.161150 7.394335 -0.03206485 1.132964

To evaluate how well a predictive model is performing on new data, it is crucial to assess the model’s accuracy on both the training set and the validation set and this could be done by comparing the error measure. From the output of the above code we could specifically compare the RMSE and MAE of both the accuracy from the prediction made on training and validation dataset. Both of the RMSE and MAE are higher when the model is used to predict the output based on the validation dataset compared to when it is used to predict the output based on the training dataset or in other words, the model perform better in predicting the result in the training dataset compared to the validation dataset.

The output above emphasize that when the model is used for a completely new dataset/points, the error measure might probably increase (or in some case could even decrease). This is happen because the model was trained using the training dataset and might somehow fit the “random effect” or noise specifically found within the training dataset and not the “random effect” or noise found in the new dataset.

Comparing average reading score training dataset model RSME with the training dataset standard deviation.

# Calculating readscr training dataset standard deviation
paste("Standard deviation of readscr on training dataset:", sd(train_df$readscr)%>%round(2))

## [1] "Standard deviation of readscr on training dataset: 19.91"

# RMSE of prediction made by the model on training dataset
paste("RMSE of model prediction on training dataset:", accuracy_train_predicted$RMSE %>% round(2))

## [1] "RMSE of model prediction on training dataset: 8.76"

Comparing the model’s RMSE to the standard deviation (SD) of the reading scores in the training set is a useful way to assess the model’s performance. If the RMSE is higher than the SD of the training set, it suggests that the model is not able to capture all the variability in the data and is making larger errors in prediction. This indicates that the model is underfitting the data and is not complex enough to capture the relationship between the predictors and the response variable. On the other hand, if the RMSE is lower than the SD of the training set, it suggests that the model is able to capture most of the variability in the data and is making relatively smaller errors in prediction. This indicates that the model is a good fit for the data and is not overfitting.

However, it is important to keep in mind that the SD of the training set represents the variability of the actual values in the training set, whereas the RMSE represents the average error of the predicted values compared to the actual values. Therefore, the two values are not directly comparable, and it is important to consider other metrics, such as the R-squared value and the residuals plot, when evaluating the model’s performance. While a lower RMSE than the SD of the training set is generally desirable, it is important to strike a balance between model complexity and accuracy. Overfitting the model by increasing its complexity can lead to a lower RMSE on the training set, but this may result in poorer performance on new data. Therefore, it is crucial to use a validation set to test the model’s performance on new data and select a model that strikes a good balance between complexity and accuracy.

In this case, the RMSE is actually lower than the standard deviation which means that the model is performing well and could safely be said that it didn`t exhibit any overfitting indication.

Multiple Linear Regression

Removing some variables within the dataset

Two other outcome variables (mathscr and testscr) will be removed so that the dataset only contains one outcome variable which is the readscr

# Removing mathscr and testscr from training dataset
train_df_update <- train_df %>%
  dplyr::select(-testscr, -mathscr)

# Removing mathscr and testscr from validation dataset
valid_df_update <- valid_df %>%
  dplyr::select(-testscr, -mathscr)

I have identified two variables with as many, or nearly as many, unique values as there are records in the dataset which are distcod (162 unique values) and district (160 unique values). These two values can be safely discarded from the updated training and validation dataset.

# Removing distcod and district from training dataset
train_df_update <- train_df_update %>%
  dplyr::select(-distcod, -district)

# Removing distcod and district from validation dataset
valid_df_update <- valid_df_update %>%
  dplyr::select(-distcod, -district)

Building correlation table

for numerical predictor variables in the training dataset and removing variable with strong correlation.

# Creating new dataframe for numerical variable correlation checking
numeric_variable <- train_df_update %>%
  dplyr::select(enrltot, teachers, calwpct, mealpct, computer, compstu, expnstu, str, avginc, elpct)

# Building correlation table for numeric variable
corr<-cor(numeric_variable)
corr

##             enrltot     teachers     calwpct     mealpct    computer
## enrltot   1.0000000  0.996896477  0.11518258  0.10378305  0.92918716
## teachers  0.9968965  1.000000000  0.11839760  0.10310069  0.94298807
## calwpct   0.1151826  0.118397604  1.00000000  0.76946534  0.09681333
## mealpct   0.1037830  0.103100690  0.76946534  1.00000000  0.04753449
## computer  0.9291872  0.942988070  0.09681333  0.04753449  1.00000000
## compstu  -0.2105159 -0.195368477 -0.25054384 -0.29616398 -0.03371209
## expnstu  -0.1560620 -0.136607509  0.02220817 -0.13546234 -0.11400456
## str       0.3492580  0.312109161  0.11554154  0.17787769  0.24593594
## avginc   -0.0183107 -0.007620632 -0.57978886 -0.73236345  0.04480089
## elpct     0.3070726  0.309793407  0.39622510  0.67409718  0.25947693
##              compstu     expnstu        str       avginc      elpct
## enrltot  -0.21051589 -0.15606203  0.3492580 -0.018310705  0.3070726
## teachers -0.19536848 -0.13660751  0.3121092 -0.007620632  0.3097934
## calwpct  -0.25054384  0.02220817  0.1155415 -0.579788855  0.3962251
## mealpct  -0.29616398 -0.13546234  0.1778777 -0.732363453  0.6740972
## computer -0.03371209 -0.11400456  0.2459359  0.044800885  0.2594769
## compstu   1.00000000  0.30821215 -0.4059263  0.341078979 -0.2128365
## expnstu   0.30821215  1.00000000 -0.6669094  0.395116105 -0.0251872
## str      -0.40592630 -0.66690943  1.0000000 -0.325406728  0.1424679
## avginc    0.34107898  0.39511611 -0.3254067  1.000000000 -0.3574894
## elpct    -0.21283647 -0.02518720  0.1424679 -0.357489360  1.0000000

# Plot a heat map of the correlation matrix to visualize the correlation
ggplot(data = reshape2::melt(corr)) +
  geom_tile(aes(x = Var1, y = Var2, fill = value)) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0)

# Filtering the correlation table to only include correlation with 0.8 or higher
corr[abs(corr) <= 0.8] <- NA
corr_table <- as.data.frame.table(corr)
corr_table <- na.omit(corr_table)
corr_table %>% arrange(Freq)

##        Var1     Var2      Freq
## 1  computer  enrltot 0.9291872
## 2   enrltot computer 0.9291872
## 3  computer teachers 0.9429881
## 4  teachers computer 0.9429881
## 5  teachers  enrltot 0.9968965
## 6   enrltot teachers 0.9968965
## 7   enrltot  enrltot 1.0000000
## 8  teachers teachers 1.0000000
## 9   calwpct  calwpct 1.0000000
## 10  mealpct  mealpct 1.0000000
## 11 computer computer 1.0000000
## 12  compstu  compstu 1.0000000
## 13  expnstu  expnstu 1.0000000
## 14      str      str 1.0000000
## 15   avginc   avginc 1.0000000
## 16    elpct    elpct 1.0000000

Above result showed that there are three pairs of variable within the train_df_update that have correlation of 0.8 or higher (proxy to multicollinearity issue) which are:

computer and enrltot
computer and teachers
teachers and enrltot

After consulting the dataset description, i decided to remove the enrltot or the total enrollment variables to remove two out of three pair of highly correlated variable (e.g.computer and enrltot & teachers and enrltot). I believe that the dataset already had another variable to proxy the two pairs:

Variable compstu which is the number of computer per student, could proxy the information provided by the pair of computer and enrltot variables.
Variable str which is the ratio between student and teacher could proxy the information provided by the pair of teachers and enrltot

The other variable that i also decided to remove was the computer (number of computer) variable. This variable could be proxied by the combination of compstu, str, and teachers variable. Actually the same approach could also be applied to teachers instead of computer for the same logic but i decided to keep the teachers variable intact because it is more intuitive for me.

Below code was used to remove the enrltot and computer variable from the training dataset:

train_df_update <- train_df_update %>%
  dplyr::select(-enrltot, -computer)
valid_df_update <- valid_df_update %>%
  dplyr::select(-enrltot, -computer)

Building a model with only a categorical input.

Picking Los Angeles County. For this purpose i will use the updated training dataset (same observations with training dataset but with less columns).

# b. Finding the mean of average reading score in Los Angeles county
mean_readscr_LA <- train_df_update %>%
  dplyr::group_by(county)%>%
  dplyr::summarize(avgmean_readscr = mean(readscr) %>% round(2)) %>%
  dplyr::filter(county=="Los Angeles")
mean_readscr_LA

## # A tibble: 1 × 2
##   county      avgmean_readscr
##   <fct>                 <dbl>
## 1 Los Angeles            644.

# c. Building SLR with readscr as the outcome and county as the predictor
county_readscr_model <- lm(readscr~county, data=train_df_update)
summary(county_readscr_model)

## 
## Call:
## lm(formula = readscr ~ county, data = train_df_update)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.945  -9.064   0.120  10.583  33.386 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           669.240      7.244  92.383  < 2e-16 ***
## countyFresno          -32.010      8.872  -3.608 0.000423 ***
## countyHumboldt         -5.718      9.035  -0.633 0.527830    
## countyKern            -33.426      8.439  -3.961 0.000116 ***
## countyLos Angeles     -25.340      8.299  -3.053 0.002691 ** 
## countyMerced          -34.700     10.245  -3.387 0.000908 ***
## countyOrange          -14.929      9.035  -1.652 0.100619    
## countyPlacer           -1.028      9.235  -0.111 0.911558    
## countySan Bernardino  -17.873      9.809  -1.822 0.070473 .  
## countySan Diego       -11.133      8.439  -1.319 0.189176    
## countySan Mateo         2.905      8.737   0.333 0.739948    
## countySanta Barbara     5.535     10.866   0.509 0.611259    
## countySanta Clara      -4.307      9.035  -0.477 0.634319    
## countyShasta          -14.269      9.485  -1.504 0.134653    
## countySonoma           -5.051      8.142  -0.620 0.536015    
## countyTulare          -27.227      8.299  -3.281 0.001295 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.2 on 146 degrees of freedom
## Multiple R-squared:    0.4,  Adjusted R-squared:  0.3383 
## F-statistic: 6.488 on 15 and 146 DF,  p-value: 1.816e-10

# d. Checking the predicted value of readscr for LA county
LA_readscr_predict <- predict(county_readscr_model, data.frame(county="Los Angeles"))
paste("Predicted average reading score for LA:", LA_readscr_predict %>% round(2))

## [1] "Predicted average reading score for LA: 643.9"

The mean of the readscr variable within the training dataset for LA county and the predicted value of readscr for LA county by using the SLR model returned the same value which is 643.9. This is kinda make sense since no other numerical variables involved as the predictor in the model. Categorical variables will be represented as binary number (0 or 1) in the SLR model if we dummified each of the categorical variable. This will result in the prediction returning the mean value of the outcome variable from the training dataset.

Even in this case i did not dummify the county categorical variable, R did it internally by using contrast() function. In this model, the baseline variable is the El Dorado County. The predicted value for each county is calculated by summing the intercept with the county coefficient. in LA case the predicted value if calculated manually (using calculator) would be 669.24 - 25.34 = 643.9

Performing Backward Elimination Regression

In order to build a backward elimination regression, i will use AIC parameter to determine which combination of predictors returned the best AIC. For this purpose, i will use MASS package stepAIC() function.

# Fitting MLR with all of the predictor variable in the updated training dataset
full_model_mlr <- lm(readscr~., data = train_df_update)

# Performing Backward Elimination Stepwise Regression
step_model <- stepAIC(full_model_mlr, direction = "backward")

## Start:  AIC=684.76
## readscr ~ county + grspan + teachers + calwpct + mealpct + compstu + 
##     expnstu + str + avginc + elpct
## 
##            Df Sum of Sq     RSS    AIC
## - county   15   1320.50  9471.4 679.09
## - teachers  1      0.47  8151.4 682.77
## - compstu   1      6.10  8157.0 682.88
## - grspan    1     13.14  8164.0 683.02
## - str       1     16.32  8167.2 683.09
## - calwpct   1     53.45  8204.3 683.82
## <none>                   8150.9 684.76
## - expnstu   1    264.74  8415.6 687.94
## - avginc    1    314.48  8465.4 688.90
## - elpct     1    491.36  8642.2 692.25
## - mealpct   1   2756.46 10907.4 729.95
## 
## Step:  AIC=679.09
## readscr ~ grspan + teachers + calwpct + mealpct + compstu + expnstu + 
##     str + avginc + elpct
## 
##            Df Sum of Sq     RSS    AIC
## - compstu   1       5.6  9477.0 677.18
## - calwpct   1      14.4  9485.8 677.33
## - teachers  1      25.5  9496.9 677.52
## - grspan    1      73.6  9544.9 678.34
## - str       1      88.4  9559.8 678.59
## <none>                   9471.4 679.09
## - expnstu   1     259.3  9730.7 681.46
## - avginc    1     318.0  9789.4 682.44
## - elpct     1     845.8 10317.2 690.94
## - mealpct   1    4039.3 13510.7 734.63
## 
## Step:  AIC=677.18
## readscr ~ grspan + teachers + calwpct + mealpct + expnstu + str + 
##     avginc + elpct
## 
##            Df Sum of Sq     RSS    AIC
## - calwpct   1      13.3  9490.3 675.41
## - teachers  1      24.0  9501.0 675.59
## - grspan    1      79.4  9556.4 676.53
## - str       1      82.9  9559.9 676.59
## <none>                   9477.0 677.18
## - expnstu   1     256.2  9733.2 679.50
## - avginc    1     313.0  9790.0 680.45
## - elpct     1     844.2 10321.2 689.01
## - mealpct   1    4034.0 13511.0 732.63
## 
## Step:  AIC=675.41
## readscr ~ grspan + teachers + mealpct + expnstu + str + avginc + 
##     elpct
## 
##            Df Sum of Sq     RSS    AIC
## - teachers  1      31.8  9522.1 673.95
## - grspan    1      85.4  9575.7 674.86
## - str       1      92.6  9582.9 674.98
## <none>                   9490.3 675.41
## - expnstu   1     244.3  9734.6 677.53
## - avginc    1     326.8  9817.1 678.89
## - elpct     1     860.3 10350.6 687.47
## - mealpct   1    6839.9 16330.2 761.33
## 
## Step:  AIC=673.95
## readscr ~ grspan + mealpct + expnstu + str + avginc + elpct
## 
##           Df Sum of Sq     RSS    AIC
## - grspan   1      84.7  9606.8 673.39
## <none>                  9522.1 673.95
## - str      1     134.8  9656.9 674.23
## - expnstu  1     240.9  9763.0 676.00
## - avginc   1     308.8  9830.9 677.12
## - elpct    1    1041.7 10563.8 688.77
## - mealpct  1    6810.1 16332.2 759.35
## 
## Step:  AIC=673.39
## readscr ~ mealpct + expnstu + str + avginc + elpct
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  9606.8 673.39
## - str      1     134.3  9741.1 673.64
## - expnstu  1     254.9  9861.7 675.63
## - avginc   1     316.2  9923.0 676.63
## - elpct    1     968.1 10574.9 686.94
## - mealpct  1    7338.0 16944.8 763.32

# Checking the summary of the Backward Elimination Stepwise Regression
summary(step_model)

## 
## Call:
## lm(formula = readscr ~ mealpct + expnstu + str + avginc + elpct, 
##     data = train_df_update)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.773  -4.353   0.243   4.558  24.434 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 673.256813  14.202517  47.404  < 2e-16 ***
## mealpct      -0.480764   0.044042 -10.916  < 2e-16 ***
## expnstu       0.002624   0.001289   2.035 0.043576 *  
## str          -0.665547   0.450646  -1.477 0.141725    
## avginc        0.333342   0.147108   2.266 0.024829 *  
## elpct        -0.193755   0.048866  -3.965 0.000112 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.847 on 156 degrees of freedom
## Multiple R-squared:  0.8495, Adjusted R-squared:  0.8447 
## F-statistic: 176.2 on 5 and 156 DF,  p-value: < 2.2e-16

Above result showed that the initial AIC with full model predictors county + grspan + teachers + calwpct + mealpct + compstu + expnstu + str + avginc + elpct is calculated to be around 684.76. After performing stepwise regression backward elimination, the combination of predictors with the lowest AIC at 673.39 is mealpct + expnstu + str + avginc + elpct

Calculating model metrics of SST and SSR

# a. Calculating SST
train_df_update$outcome_diff <- train_df_update$readscr - mean(train_df_update$readscr)
train_df_update$squared_diff <- train_df_update$outcome_diff^2
paste("SST model metrics:", sum(train_df_update$squared_diff) %>% round(2))

## [1] "SST model metrics: 63846.93"

# b. Calculating SSR
train_df_update$explained <- step_model$fitted.values - mean(train_df_update$readscr)
train_df_update$squared_explained <- train_df_update$explained^2
paste("SSR model metrics:",sum(train_df_update$squared_explained) %>% round(2))

## [1] "SSR model metrics: 54240.13"

# c. Calculating SSR/SST
ssr_sst <- sum(train_df_update$squared_explained)/sum(train_df_update$squared_diff)
paste("SSR/SST model metrics:", ssr_sst %>% round(4))

## [1] "SSR/SST model metrics: 0.8495"

SSR/SST value could also be found in the Multiple R-squared section within the step_model summary and the value is the same for both of the metrics (0.8495).

Getting p-value from a t-value

Getting to a t-value to a p-value. For this question i would pick expnstu variable. The associated t-value for that variable is 2.035. Below code was used to visualize the t-distribution with value of 2.035 and degree of freedom of 156.

# Visualizing the t-distribution
visualize.t(stat = 2.035, df=156)

The above graph showed that approximately 97.8% area under the curve is shaded. The p-value for expnstu from the model was 0.043576 ~ 0.044. This can be calculated using this formula:

p-value = 2 * pt(abs(t value), residual df, lower.tail = FALSE)

Below code was used to calculate p-value from the t-value

# Calculating p-value from t-value
p_value <- 2*pt(abs(2.035), 156 , lower.tail = FALSE)
paste("p-value of the expnstu variable:", p_value %>% round(3))

## [1] "p-value of the expnstu variable: 0.044"

Or this value could also be derived from previous t-distribution graph by subtracting 1 with the percentage of the shaded area and then multiply the number by 2. This calculation was done manually using calculator: p-value = (1-0.978) * 2 = 0.44.

Calculating model F-Statistics

The F-statistics value for my model is 176.2 on 5 and 156 degree of freedom. The F-statistic measures the ratio of between-group variability to within-group variability, where the former measures differences between group means and the latter measures variation within each group. If the calculated F-value is greater than the critical F-value, the null hypothesis is rejected, indicating a significant difference between group means. If the calculated F-value is not greater than the critical F-value, the null hypothesis is failed to be rejected, indicating there is not enough evidence to suggest a significant difference between group means.

Below code was used to calculate the F-statistics value step by step for the regression model:

# Calculating SST of the model. This has been done previously
SST <- sum(train_df_update$squared_diff)

# Calculating SSR of the model. This has been done previously
SSR <- sum(train_df_update$squared_explained)

# Calculating SSE of the model  
SSE <- sum(step_model$residuals^2)

# calculate the degrees of freedom for regression
dfr <- length(step_model$coefficients) - 1

# calculate the degrees of freedom for residuals
dfe <- length(step_model$residuals) - length(step_model$coefficients)

# calculate mean square for the regression model
MSR <- SSR / dfr

# calculate the mean square for residuals
MSE <- SSE / dfe

# calculate the F-statistic
F_stats <- MSR / MSE
paste("F-statistics value from the model:",F_stats %>% round(1))

## [1] "F-statistics value from the model: 176.2"

The manually calculated value of F-statistics above matches the value from the regression model summary at 176.2 on 5 and 156 Degree of Freedom.

Predicting fictional data set using the model

Predicting fictional school district average reading score with fictitious attributes. Below code was used to achieve this objective:

# Predicting the fictional school average reading score
readscr_predict_fictional <- predict(step_model, data.frame(mealpct = 57.5, expnstu = 5500.5, str=19.2, avginc=11.3, elpct=2.3))
paste("Fictional school average reading score:", readscr_predict_fictional %>% round(2))

## [1] "Fictional school average reading score: 650.59"

The model predicted that the fictional school with attributes mealpct = 57.5, expnstu = 5500.5, str=19.2, avginc=11.3, elpct=2.3 will have average reading score test around 650.59.

Assessing the accuracy of the step_model, using `accuracy()` function from `forecast` library

# Calculating the accuracy of the prediction made on the training dataset
train_pred_mlr <- predict(step_model, train_df_update)
acc_train_pred_mlr <- data.frame(accuracy(train_pred_mlr, train_df_update$readscr))
acc_train_pred_mlr

##                     ME     RMSE      MAE         MPE      MAPE
## Test set -2.954457e-13 7.700729 5.840883 -0.01389042 0.8927613

# Calculating the accuracy of the prediction made on the validation dataset
val_pred_mlr <- predict(step_model, valid_df_update)
acc_val_pred_mlr <- data.frame(accuracy(val_pred_mlr, valid_df_update$readscr))
acc_val_pred_mlr

##                 ME     RMSE      MAE        MPE      MAPE
## Test set 0.6643407 8.017223 6.177817 0.08112855 0.9483656

# Combining accuracy output from training and validation dataset for better view and comparison
model_accuracy_mlr <- rbind(acc_train_pred_mlr, acc_val_pred_mlr)
rownames(model_accuracy_mlr) <- c("Training Dataset", "Validation Dataset")
model_accuracy_mlr

##                               ME     RMSE      MAE         MPE      MAPE
## Training Dataset   -2.954457e-13 7.700729 5.840883 -0.01389042 0.8927613
## Validation Dataset  6.643407e-01 8.017223 6.177817  0.08112855 0.9483656

Output above showed that both of the RMSE and MAE are higher when the model is used to predict the output based on the validation dataset compared to when it is used to predict the output based on the training dataset or in other words, the model perform better in predicting the result in the training dataset compared to the validation dataset. The same trend with the previous SLR model.

Now to compare the SLR and MLR performance, i am combining the error measure of both SLR and MLR on both of the Training and Validation dataset:

model_accuracy_all <- rbind(acc_train_pred_mlr, accuracy_train_predicted, acc_val_pred_mlr, accuracy_valid_predicted)
rownames(model_accuracy_all) <- c("Training Dataset MLR", "Training Dataset SLR", "Validation Dataset MLR", "Validation Dataset SLR")
model_accuracy_all

##                                   ME     RMSE      MAE         MPE      MAPE
## Training Dataset MLR   -2.954457e-13 7.700729 5.840883 -0.01389042 0.8927613
## Training Dataset SLR   -3.768505e-13 8.761614 6.961201 -0.01782601 1.0625976
## Validation Dataset MLR  6.643407e-01 8.017223 6.177817  0.08112855 0.9483656
## Validation Dataset SLR -4.492851e-02 9.161150 7.394335 -0.03206485 1.1329637

Fundamentally, the more parameter/variable we introduce to the model, the more risk we would get in terms of overfitting. Based on that knowledge, Multiple Linear Regression inherent more overfitting risk compared to the Single Linear Regression due to the fact the MLR uses more variables than the SLR. But since the number of observation in both the SLR and MLR are the same and quite big (162) observation, i believe that the risk of overfitting would be relatively low.

In terms of performance measured by the error measure/accuracy, RMSE and MAE specifically, MLR performs slightly better than the SLR indicated by the lower RMSE and MAE shown in the table.