Introduction

This GitHub repository presents a project aimed at constructing a K-Nearest-Neighbor (KNN) classifier model using the Spotify dataset in R. The primary objective of this project is to anticipate the song preferences of a fictional individual named Putra. The model classifies songs as either “like” or “dislike” based on their inherent characteristics.

The Spotify dataset, obtained from the popular music streaming platform, encompasses a diverse range of attributes, including song duration, tempo, danceability, energy, and more. Each song in the dataset has been labeled as either “like” or “dislike” according to Putra’s personal preferences. Two distinct datasets were utilized for this analysis:

  1. spotify.csv: This dataset served as the training set for the KNN model.

  2. spot100.csv: This dataset was utilized as a song pool for prediction purposes, determining whether a selected song would be liked or disliked by Putra.

Importing Relevant Libraries

Before answering the assignment questions, relevant libraries need to be imported first. Below code is used to import libraries.

library(tidyverse)
library(naniar)
library(caret)
library(FNN)
library(carData)
library(gridExtra)
library(e1071)

Importing and Manipulating Dataset

1. Importing spot100.csv dataset and picking one song for prediction purpose

# Importing the dataset into the R environment
data <- read.csv("spot100.csv")

Song titled Photograph will be picked for the prediction purposes.

data[data$name == "Photograph", ]
##                        id       name duration energy key loudness mode
## 23 4cj6Ti4wOXZ5ZdWlUZxRSP Photograph     4.32  0.379   4   -10.48    1
##    speechiness acousticness instrumentalness liveness valence   tempo
## 23      0.0359        0.607         0.000472   0.0986    0.22 108.033
##    danceability
## 23        0.718
song <- data[data$name == "Photograph", ]

The picked song has the following attributes
- danceability: 0.718 - energy: 0.379 - loudnesss: -10.48 - speechiness: 0.0359 - acousticness: 0.607 - instrumentalness: 0.000472 - liveness: 0.0986 - tempo: 108.033 - duration: 4.32 - valence: 0.22

2. Importing and exploring spotify.csv dataset

spotify <- read.csv("spotify.csv")
str(spotify)
## 'data.frame':    2017 obs. of  17 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ acousticness    : num  0.0102 0.199 0.0344 0.604 0.18 0.00479 0.0145 0.0202 0.0481 0.00208 ...
##  $ danceability    : num  0.833 0.743 0.838 0.494 0.678 0.804 0.739 0.266 0.603 0.836 ...
##  $ duration_ms     : int  204600 326933 185707 199413 392893 251333 241400 349667 202853 226840 ...
##  $ energy          : num  0.434 0.359 0.412 0.338 0.561 0.56 0.472 0.348 0.944 0.603 ...
##  $ instrumentalness: num  2.19e-02 6.11e-03 2.34e-04 5.10e-01 5.12e-01 0.00 7.27e-06 6.64e-01 0.00 0.00 ...
##  $ key             : int  2 1 2 5 5 8 1 10 11 7 ...
##  $ liveness        : num  0.165 0.137 0.159 0.0922 0.439 0.164 0.207 0.16 0.342 0.571 ...
##  $ loudness        : num  -8.79 -10.4 -7.15 -15.24 -11.65 ...
##  $ mode            : int  1 1 1 1 0 1 1 0 0 1 ...
##  $ speechiness     : num  0.431 0.0794 0.289 0.0261 0.0694 0.185 0.156 0.0371 0.347 0.237 ...
##  $ tempo           : num  150.1 160.1 75 86.5 174 ...
##  $ time_signature  : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ valence         : num  0.286 0.588 0.173 0.23 0.904 0.264 0.308 0.393 0.398 0.386 ...
##  $ target          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ song_title      : chr  "Mask Off" "Redbone" "Xanny Family" "Master Of None" ...
##  $ artist          : chr  "Future" "Childish Gambino" "Future" "Beach House" ...

The variable target, the outcome variable, is an int type variable in the original dataset. This variable needs to be converted into factor/categorical type variable that will be used as the response variable in this model as per prompt instruction. Below code was used to do that.

spotify$target <- as.factor(spotify$target)
str(spotify$target)
##  Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

The target variable has two unique values/levels which are 0 and 1 in which 0 implies that the song is disliked by Putra and 1 is liked by Putra. Below code is used to tabulate the occurrence of each value in the dataset.

table(spotify$target)
## 
##    0    1 
##  997 1020

In this dataset, “0” outcome class has 997 records while “1” outcome class has “1020” records. No “NA” value observed in this variable. The outcome variables seems to have around the same frequency otherwise balanced outcome class.

3. Checking for missing value in the spotify.csv dataset

# Tabulating the missing value and its percentage using naniar library
miss_var_summary(spotify)
## # A tibble: 17 × 3
##    variable         n_miss pct_miss
##    <chr>             <int>    <dbl>
##  1 X                     0        0
##  2 acousticness          0        0
##  3 danceability          0        0
##  4 duration_ms           0        0
##  5 energy                0        0
##  6 instrumentalness      0        0
##  7 key                   0        0
##  8 liveness              0        0
##  9 loudness              0        0
## 10 mode                  0        0
## 11 speechiness           0        0
## 12 tempo                 0        0
## 13 time_signature        0        0
## 14 valence               0        0
## 15 target                0        0
## 16 song_title            0        0
## 17 artist                0        0
# Reconfirm with manual calculation
cat("Count of missing value in the Spotify dataset is",is.na(spotify) %>% sum())
## Count of missing value in the Spotify dataset is 0

spotify dataset has no missing value (fortunately). No further imputation is needed.

4. Removing unnecessary variables from spotify.csv dataset

spotify <- spotify %>%
  select(-X, -key, -mode, -time_signature)
colnames(spotify)
##  [1] "acousticness"     "danceability"     "duration_ms"      "energy"          
##  [5] "instrumentalness" "liveness"         "loudness"         "speechiness"     
##  [9] "tempo"            "valence"          "target"           "song_title"      
## [13] "artist"

Partitioning spotify.csv dataset into train and test set

# Setting seed for reproducibility
set.seed(250)

# Random sampling the dataset index without replacement with 60% for training set
train_index <- sample(c(1:nrow(spotify)), nrow(spotify)*0.6) 

# Partition the dataset into training and validation set based on the index sampling
train_df <- spotify[train_index, ]
valid_df <- spotify[-train_index, ]

# Resetting the index for both train_df and valid_df for ease of subsetting
rownames(train_df) <- NULL
rownames(valid_df) <- NULL

# Ensuring the partitioned dataset has been properly set
paste("number of rows for original dataset:", nrow(spotify))
## [1] "number of rows for original dataset: 2017"
paste("number of rows for training set:", nrow(train_df))
## [1] "number of rows for training set: 1210"
paste("number of rows for validation set:", nrow(valid_df))
## [1] "number of rows for validation set: 807"

spotify dataset has been partitioned into training (train_df) and validation(valid_df) set with 60% and 40% observations from the original dataset respectively.

Performing pairwise t-test for each of the numric variables

Assuming the the data follows the normality assumption, t-test was conducted to check whether specific predictor variables could be used to provide “meaningful” prediction on whether a song will be liked by Putra or not.

To do this, the spotify.csv train set was subsetted into two different sets, one with the outcome of 1 and the other is for 0. student t-test were then performed for each pair of the predictor in each subset.

# Split the dataframe into two groups based on target variable (0 and 1)
spotify_group0 <- train_df[train_df$target == 0, ]
spotify_group1 <- train_df[train_df$target == 1, ]

# Loop through columns and perform t-test for each of the numeric variables between the two groups
for (col in names(train_df)) {
  if (is.numeric(train_df[[col]]) && col != "target") {
    t_test <- t.test(spotify_group1[[col]], spotify_group0[[col]])
    print(paste("Variable:", col))
    print(paste("t-statistic:", t_test$statistic %>% round(2)))
    print(paste("p-value:", t_test$p.value))
  }
}
## [1] "Variable: acousticness"
## [1] "t-statistic: -3.94"
## [1] "p-value: 8.81019895574373e-05"
## [1] "Variable: danceability"
## [1] "t-statistic: 6.13"
## [1] "p-value: 1.19234524346942e-09"
## [1] "Variable: duration_ms"
## [1] "t-statistic: 5.73"
## [1] "p-value: 1.29028893195551e-08"
## [1] "Variable: energy"
## [1] "t-statistic: 0.74"
## [1] "p-value: 0.460580108809561"
## [1] "Variable: instrumentalness"
## [1] "t-statistic: 6.3"
## [1] "p-value: 4.25318350249433e-10"
## [1] "Variable: liveness"
## [1] "t-statistic: 0.08"
## [1] "p-value: 0.936400923035821"
## [1] "Variable: loudness"
## [1] "t-statistic: -3.42"
## [1] "p-value: 0.000648131744104459"
## [1] "Variable: speechiness"
## [1] "t-statistic: 5.14"
## [1] "p-value: 3.17004823925582e-07"
## [1] "Variable: tempo"
## [1] "t-statistic: 1.77"
## [1] "p-value: 0.0763991926342424"
## [1] "Variable: valence"
## [1] "t-statistic: 2.73"
## [1] "p-value: 0.00637579481606816"

In order to screen the variables that are statistically significantly different, 0.05 significance level to discern the variables was used. Based on the alpha threshold of 0.05, below are the variables that is statistically significantly different between the two groups:

Numeric Variables: acousticness, danceability, duration_ms, instrumentalness, loudness, speechiness, valence

Below are the list of variables with student t-test p-value result is more than 0.05: energy, liveness, tempo

The variable that is not statistically significantly different will be removed before KNN model is developed. it makes sense to remove the variables that is not statistically significantly different based on the t-test result because those variables would not provide sufficient “power” in discerning/discriminating the two different outcome class in both groups.

This is because simply we could not tell whether the difference in the outcome, between 0 and 1, might be caused by the above-mentioned variables because the variable`s characteristic, in this context is measured by the mean, seems to be the “same” in both groups. The inclusion of these variables will add to the model complexity without improving the predictive power.

In this case, it seems that the song that Putra liked and disliked, has more or less the same energy, liveness, and tempo.

Below code was used to remove the energy, liveness, and tempo variables from the dataset.

# Removing variables from original dataset and rearranging outcome column for ease of handling
spotify <- spotify %>%
  select(-energy, -liveness, -tempo) %>%
  select(-target,everything(), target)

# Removing variables from training dataset and rearranging outcome column for ease of handling
train_df <- train_df %>%
  select(-energy, -liveness, -tempo) %>%
  select(-target,everything(), target)

# Removing variables from validation dataset and rearranging outcome column for ease of handling
valid_df <- valid_df %>%
  select(-energy, -liveness, -tempo) %>%
  select(-target,everything(), target)

Preprocessing dataset

In normalizing the dataset, i preprocessed all dataset relevant to this question which are the original,variable-removed spotify dataset as well as the training and validation dataset.

I didn’t include the song_title and artist predictor variable because it does not need to be normalized due to its character type. I also do not think that this predictor will be any of use to my KNN model input because song_title and artists just simply showing us the title and the artist of the song that Georges liked. The attributes/characteristics of these songs are already described by the other numerical predictors. These two variables will be relevant later when i try to find the n nearest neighbors for my picked song.

I will also only pick the numeric variables input for my picked song that is associated with the training dataset numeric variables. I removed the id as well as the name variable from my picked song dataframe. I then converted the duration from my picked song dataframe to miliseconds to match the unit in the model training dataset.

Below code was used to normalize the data.

# Initializing normalized training, validation data, complete dataframe to originals
train_norm_df <- train_df
valid_norm_df <- valid_df
spotify_norm <- spotify

# Using preProcess () from the caret package to normalize predictor variables
norm_values <- preProcess(train_df[,1:7], method=c("center", "scale"))
train_norm_df[,1:7] <- predict(norm_values, train_df[,1:7])
valid_norm_df[,1:7] <- predict(norm_values, valid_df[,1:7])
spotify_norm[,1:7] <- predict(norm_values, spotify[,1:7])

# Preparing my picked song dataframe for the knn model input
song <- song %>%
  select(acousticness, danceability, duration, instrumentalness, loudness, speechiness, valence) %>% 
  mutate(duration_ms = duration * 60 * 1000) %>%
  select (-duration)

# Normalizing my picked song dataframe
song_norm <- predict(norm_values, song)

Building KNN Model and Predicting the Picked Song Preference

# Creating knn model to predict the classification of my picked song
song_nn <- knn(train=train_norm_df[,1:7], test=song_norm, cl=train_norm_df[,10], k=7)

# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance
attributes(song_nn)
## $levels
## [1] "0"
## 
## $class
## [1] "factor"
## 
## $nn.index
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  583  660  817 1206  234  391  622
## 
## $nn.dist
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 1.001672 1.106059 1.129466 1.209548 1.264968 1.275437 1.280182

Based on the knn model prediction, my picked song “Photograph” by Ed-Sheeran will not probably get into Putra`s attention because he might not like it.

The knn model returned the outcome class of “0” for the picked song and it means that Putra will not like it.

The song`s 7 nearest neighbors from the Putra song list, including the artist and respective outcome class are as follows:

# Checking for the 7 nearest neighbors of my picked song from the George`s list of song
train_df[(row.names(train_df)[attr(song_nn, "nn.index")]),8:10]
##                               song_title       artist target
## 583         Morning Call - Remix Version        Ibadi      0
## 660  Petals on The Wind (Remake Version)        Wable      0
## 817               Summer Night You and I Standing Egg      0
## 1206               Body Like A Back Road     Sam Hunt      0
## 234                         Just a Phase   Adam Craig      0
## 391                          Mannish Boy Muddy Waters      1
## 622                       1-800-273-8255        Logic      0

From the above table, out of 7 nearest neighbors of the picked song, 6 of them were labeled as “0” or Putra didn`t like it. Based on the rule of the majority, knn classified the picked song as “0” thus Putra will probably not like it.

Determining optimal k-value using training-testing set cross validation

# Initialize a data frame with two columns: k, and accuracy
accuracy_df <- data.frame(k=seq(1,14,1), accuracy=rep(0.14))

# Compute knn for different k on validation
for(i in 1:14){
  knn.pred <- knn(train_norm_df[,1:7], valid_norm_df[,1:7], 
                  cl = train_norm_df[,10], k=i)
  accuracy_df[i,2] <- confusionMatrix(knn.pred, valid_norm_df[,10])$overall[1] %>% round(3)
}

accuracy_df
##     k accuracy
## 1   1    0.690
## 2   2    0.675
## 3   3    0.714
## 4   4    0.686
## 5   5    0.714
## 6   6    0.699
## 7   7    0.720
## 8   8    0.720
## 9   9    0.726
## 10 10    0.722
## 11 11    0.722
## 12 12    0.714
## 13 13    0.703
## 14 14    0.705

The above table of k-value along with its associated accuracy returned that when k equals to 9, the model has the highest accuracy. Therefore k=9 will be chosen to develop the updated knn model.

Plotting k-value vs Model Accuracy

ggplot(accuracy_df, mapping=aes(x=k, y=accuracy))+
  geom_point(color='steelblue', size=2)+
  theme_light()+
  ggtitle("Scatterplot of k Value vs Accuracy") +
  xlab("Number of k") + 
  ylab("Model Accuracy")+
  scale_x_continuous(breaks=accuracy_df$k)

Building KNN Model with k=9 and Predicting the Picked Song Preference

# Creating knn model to predict the classification of my picked song with k=9
song_nn2 <- knn(train=train_norm_df[,1:7], test=song_norm, cl=train_norm_df[,10], k=9)

# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance with k=9
attributes(song_nn2)
## $levels
## [1] "0"
## 
## $class
## [1] "factor"
## 
## $nn.index
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,]  583  660  817 1206  234  391  622    6  130
## 
## $nn.dist
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
## [1,] 1.001672 1.106059 1.129466 1.209548 1.264968 1.275437 1.280182 1.306181
##          [,9]
## [1,] 1.345578

The knn k=9 model classified the picked song outcome class as “0” and Putra might probably not like it. The knn model with k=9 returned the same classification prediction for the picked song as with the previous model with k=7 and it makes sense to me since k=7 and k=9 accuracy only differs as much as 0.8% (percentage difference).

The picked songs 9 nearest neighbors from the Putras song list, including the artist and respective outcome class are as follows:

# Checking for the 9 nearest neighbors of my picked song from the George`s list of song
train_df[(row.names(train_df)[attr(song_nn2, "nn.index")]),8:10]
##                               song_title       artist target
## 583         Morning Call - Remix Version        Ibadi      0
## 660  Petals on The Wind (Remake Version)        Wable      0
## 817               Summer Night You and I Standing Egg      0
## 1206               Body Like A Back Road     Sam Hunt      0
## 234                         Just a Phase   Adam Craig      0
## 391                          Mannish Boy Muddy Waters      1
## 622                       1-800-273-8255        Logic      0
## 6                                Grandad   Clive Dunn      0
## 130                               For Me        Gyepy      0

From the above table, out of 9 nearest neighbors of the picked song, 8 of them were labeled as “0” or Putra didn`t like it. Based on the rule of the majority, knn classified my song as “0” thus Putra will probably not like it.