Introduction

This GitHub repository contains data mining analysis on Airbnb rental properties in Monserrat, Buenos Aires, Argentina. The analysis focuses on delivering two key outputs to enhance decision-making for property owners and prospective tenants.

1. Property Descriptive Analytics:

2. Property Predictive Analytics:

These comprehensive data mining analysis results serve as valuable assets for property owners and prospective tenants in Monserrat. By leveraging data-driven insights, users can make informed decisions and enhance their ability to choose suitable rental properties.

Explore this repository to access the analysis code, datasets, and detailed documentation. Make data-driven choices for your property investments or find the perfect Airbnb rental in Monserrat with confidence.

Importing Relevant Libraries

library(tidyverse)
library(readr)
library(naniar)
library(jsonlite)
library(tidyr)
library(tidytext)
library(wordcloud)
library(leaflet)
library(scales)
library(ggbeeswarm)
library(rpart)  
library(rpart.plot)  
library(corrplot)
library(caret)
library(e1071)
library(forecast)
library(FNN)

Importing Dataset

master_data <- read_csv("buenos.csv")
data <- read_csv("buenos.csv")

Step I: Data Preparation & Exploration

1. Missing Values (including Data Cleaning and Manipulation)

Please note that the explanations for what we did and why we did it are available at every steps below with the summary at the end of the steps.

  1. Filtering to Monserrat Neighborhood
data <- data %>%
  filter(neighbourhood_cleansed=='Monserrat')
  1. Checking for Missing Value
miss_var_summary(data)
## # A tibble: 75 × 3
##    variable                     n_miss pct_miss
##    <chr>                         <int>    <dbl>
##  1 neighbourhood_group_cleansed    902    100  
##  2 bathrooms                       902    100  
##  3 calendar_updated                902    100  
##  4 license                         883     97.9
##  5 host_about                      392     43.5
##  6 neighborhood_overview           375     41.6
##  7 neighbourhood                   375     41.6
##  8 host_neighbourhood              356     39.5
##  9 host_location                   202     22.4
## 10 review_scores_accuracy          190     21.1
## # ℹ 65 more rows
  1. Removing variables with missing value >50%: neighbourhood_group_cleansed, bathrooms,calendar_updated, and license
data <- data%>%
  select(-neighbourhood_group_cleansed, -bathrooms, -calendar_updated, -license)
  1. Checking for other variables that might not be useful for this particular analysis
head(data)
## # A tibble: 6 × 71
##       id listing_url             scrape_id last_scraped source name  description
##    <dbl> <chr>                       <dbl> <date>       <chr>  <chr> <chr>      
## 1 143663 https://www.airbnb.com…   2.02e13 2023-03-29   city … Apar… "<b>The sp…
## 2  16695 https://www.airbnb.com…   2.02e13 2023-03-29   city … DUPL… "<b>The sp…
## 3 148284 https://www.airbnb.com…   2.02e13 2023-03-29   city … Sunn… "Sunny apa…
## 4  23798 https://www.airbnb.com…   2.02e13 2023-03-29   city … STUN…  <NA>      
## 5  31514 https://www.airbnb.com…   2.02e13 2023-03-29   city … BEAU… "The Duple…
## 6  42450 https://www.airbnb.com…   2.02e13 2023-03-29   city … Fren… "This refi…
## # ℹ 64 more variables: neighborhood_overview <chr>, picture_url <chr>,
## #   host_id <dbl>, host_url <chr>, host_name <chr>, host_since <date>,
## #   host_location <chr>, host_about <chr>, host_response_time <chr>,
## #   host_response_rate <chr>, host_acceptance_rate <chr>,
## #   host_is_superhost <lgl>, host_thumbnail_url <chr>, host_picture_url <chr>,
## #   host_neighbourhood <chr>, host_listings_count <dbl>,
## #   host_total_listings_count <dbl>, host_verifications <chr>, …
  • Removing following variables from dataset:

listing_url, scrape_id, last_scraped, source, description, neighborhood_overview, picture_url, host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighborhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identify_verified, calendar_last_scraped, first_review, and last_review.

  • Combining descriptive variable and the URL columns into its own dataframe that could later be LEFT JOIN using id column.

  • Combining host information into its own dataframe that could later be LEFT JOIN using id column.

# Removing non-essential variable from main dataset
data <- data %>%
  select(- listing_url, - scrape_id, - last_scraped, - source, - description, - neighborhood_overview, - picture_url, - host_id, - host_url, - host_name, - host_since, - host_location, - host_about, - host_response_time, - host_response_rate, - host_acceptance_rate, - host_is_superhost, - host_thumbnail_url, - host_picture_url, -host_neighbourhood, - host_listings_count, - host_total_listings_count, - host_verifications, - host_has_profile_pic, - host_identity_verified, - calendar_last_scraped, -calculated_host_listings_count, -calculated_host_listings_count_entire_homes, -calculated_host_listings_count_private_rooms, -calculated_host_listings_count_shared_rooms, -first_review, -last_review)

# Creating host related information dataset
host <- master_data %>%
  filter(neighbourhood_cleansed=="Monserrat") %>%
  select(id, host_id, host_url, host_name, host_since, host_location, host_about, 
         host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, 
         host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count,
         host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified,
         calculated_host_listings_count, calculated_host_listings_count_entire_homes,
         calculated_host_listings_count_private_rooms,
         calculated_host_listings_count_shared_rooms)

# Creating Description information dataset
desc <- master_data %>%
  filter(neighbourhood_cleansed=="Monserrat")%>%
  select(id, description, neighborhood_overview, picture_url, listing_url)
  1. Checking for missing value in the main dataset
miss_var_summary(data)
## # A tibble: 39 × 3
##    variable                    n_miss pct_miss
##    <chr>                        <int>    <dbl>
##  1 neighbourhood                  375    41.6 
##  2 review_scores_accuracy         190    21.1 
##  3 review_scores_cleanliness      190    21.1 
##  4 review_scores_checkin          190    21.1 
##  5 review_scores_communication    190    21.1 
##  6 review_scores_location         190    21.1 
##  7 review_scores_value            190    21.1 
##  8 review_scores_rating           188    20.8 
##  9 reviews_per_month              188    20.8 
## 10 bedrooms                        88     9.76
## # ℹ 29 more rows

Based on the above information, below inferences and decisions were made

  1. neighbourhood: this variable does not give any additional value to the overall analysis as this is a duplicate information from the “neighbourhood_cleansed”. This variable will be removed

  2. all review scores: the missing observation related to all review scores will be removed review is something that is subjective based on the input of the user thus it would not be a wise decision to impute this as it will introduce bias to the dataset

  3. reviews_per_month missing observation will be removed as imputing it might introduce bias to the dataset

  4. bedrooms, beds, and bathroom_text missing observation will be removed as it is property specific and imputing the value might give a misleading the associated properly characteristics.

# Removing neighbourhood column
data <- data %>%
  select (-neighbourhood)

# Removing missing observations from above-mentioned variables
data <- subset(data, complete.cases(review_scores_accuracy,
                                     review_scores_checkin,
                                     review_scores_cleanliness,
                                     review_scores_communication,
                                     review_scores_location,
                                     review_scores_value,
                                     review_scores_rating,
                                     reviews_per_month,
                                     bedrooms,
                                     beds,
                                     bathrooms_text))
  1. Manipulating bathroom data
# extract the numerical value from the bathrooms_text variable
data$bathrooms <- as.numeric(gsub("[^[:digit:]./]", "", data$bathrooms_text))

# create a new variable to indicate whether the bathroom is shared or not
data$shared_bathroom <- ifelse(grepl("shared", data$bathrooms_text, ignore.case = TRUE), "Yes", "No")

# handle cases where bathrooms_text is "shared bath" or missing
data$bathrooms[grepl("shared", data$bathrooms_text, ignore.case = TRUE) |
               is.na(data$bathrooms_text)] <- NA

# handle cases where bathrooms_text is "0 bath" or "0.5 bath"
data$bathrooms[data$bathrooms == 0] <- 0.5
data$bathrooms[data$bathrooms == 0.5 & grepl("shared", data$bathrooms_text, ignore.case = TRUE)] <- NA

# handle cases where bathrooms_text is "X shared bath" or "X.X shared bath"
data$bathrooms[grepl("shared", data$bathrooms_text, ignore.case = TRUE) &
               !grepl("0\\.5", data$bathrooms_text) &
               !is.na(data$bathrooms_text)] <- as.numeric(gsub("[^[:digit:]./]", "", data$bathrooms_text[grepl("shared", data$bathrooms_text, ignore.case = TRUE)]))

# replace missing values with the median number of bathrooms
data$bathrooms[is.na(data$bathrooms)] <- median(data$bathrooms, na.rm = TRUE)
  1. Manipulating amenities data
# split the string column into a list column
data$amenities_list <- lapply(data$amenities, jsonlite::fromJSON)

# specify the maximum length of the list
max_len <- max(lengths(data$amenities_list))

# pad shorter lists with NA values
data$amenities_list <- lapply(data$amenities_list, `length<-`, max_len)

# convert the list column to wide format
data <- unnest_wider(data, col = amenities_list, names_sep = "_")

# converting all amenities columns into categorical
for (i in 1:10) {
  col_name <- paste0("amenities_list_", i)
  data[[col_name]] <- as.factor(data[[col_name]])}
  1. Merging the previously splitted dataset into new merged data
merged_data <- left_join(data,host, by='id')
merged_data <- left_join(merged_data, desc, by='id')
  1. Grouping the amenities for simplification
merged_data$Kitchen <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Kitchen", x)) > 0, 1, 0) })
merged_data$Wifi <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Wifi", x)) > 0, 1, 0) })
merged_data$Air_conditioning <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Air conditioning", x)) > 0, 1, 0) })
merged_data$Elevator <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Elevator", x)) > 0, 1, 0) })
merged_data$Dishes_and_silverware <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Dishes and silverware", x)) > 0, 1, 0) })
merged_data$Washer <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Washer", x)) > 0, 1, 0) })
merged_data$Body_soap <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Body soap", x)) > 0, 1, 0) })
merged_data$Microwave <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Microwave", x)) > 0, 1, 0) })
merged_data$Paid_parking_off_premises <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Paid parking off premises", x)) > 0, 1, 0) })
merged_data$TV <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("TV", x)) > 0, 1, 0) })
  1. Grouping Availability information for simplification

The availability were binned into two variables

  • short term: (availability 30 + availability 60 + availability 90)/3, if the value is more than mean of the short term column, then its 1 otherwise 0

  • long term: availability 365. if the value is more the mean of the availability 365 column, then its 1, otherwise its 0

1 means that the property has the availability for that particular short or long term, while 0 is the otherwise.

Another manipulation in this part is that another two new columns were created to aid in the analysis which are:

  • years: describing how many years has elapsed since the listing first listed on the AirBnB

  • total_amenities; describing how many variants of amenities each listing have.

merged_data <- merged_data %>%
  mutate(mean_short = (availability_30+availability_60+availability_90)/3) %>%
  mutate(short_term_availability = ifelse(mean_short<mean(mean_short), 0, 1)) %>%
  mutate(long_term_availability = ifelse(availability_365 < mean(availability_365), 0,1)) %>%
  mutate(start_date = as.Date(host_since)) %>%
  mutate(end_date = as.Date("2023-05-05")) %>%
  mutate(years = as.numeric(difftime(end_date, start_date)))%>%
  mutate(years = years/365) %>%
  mutate(total_amenities = Kitchen+Wifi+Air_conditioning+Elevator+Dishes_and_silverware+Washer
         +Body_soap+Microwave+Paid_parking_off_premises)
  1. Housekeeping on host data, price data, and property/room type data
# Remove "N/A" value in the host data
merged_data <- subset(merged_data, host_response_time != "N/A")
merged_data <- subset(merged_data, host_response_rate != "N/A")
merged_data <- subset(merged_data, host_acceptance_rate != "N/A")

# Converting host response rate and acceptance rate into numeric
merged_data$host_response_rate <- as.numeric(gsub("%", "", merged_data$host_response_rate))/100
merged_data$host_acceptance_rate <- as.numeric(gsub("%", "", merged_data$host_acceptance_rate))/100

# Preparing price data
merged_data$price <- gsub("\\$|,", "", merged_data$price)
merged_data$price <- as.numeric(merged_data$price)

# Converting room and property type data into categorical
merged_data$property_type <- as.factor(merged_data$property_type)
merged_data$room_type <- as.factor(merged_data$room_type)

The “N/A” value in the host_response_time and host_response_rate were decided to be removed due to its low proportion in the dataset. Imputing it might introduce bias.

  1. Converting variables into categorical
merged_data$room_type <- as.factor(merged_data$room_type)
merged_data$instant_bookable <- as.factor(merged_data$instant_bookable)
merged_data$shared_bathroom <- as.factor(merged_data$shared_bathroom)
merged_data$host_response_time <- as.factor(merged_data$host_response_time)
merged_data$host_is_superhost <- as.factor(merged_data$host_is_superhost)
merged_data$host_identity_verified <- as.factor(merged_data$host_identity_verified)
merged_data$Kitchen <- as.factor(merged_data$Kitchen)
merged_data$Wifi <- as.factor(merged_data$Wifi)
merged_data$Air_conditioning <- as.factor(merged_data$Air_conditioning)
merged_data$Elevator <- as.factor(merged_data$Elevator)
merged_data$Dishes_and_silverware <- as.factor(merged_data$Dishes_and_silverware)
merged_data$Washer <- as.factor(merged_data$Washer)
merged_data$Body_soap <- as.factor(merged_data$Body_soap)
merged_data$Microwave <- as.factor(merged_data$Microwave)
merged_data$Paid_parking_off_premises <- as.factor(merged_data$Paid_parking_off_premises)
merged_data$short_term_availability <- as.factor(merged_data$short_term_availability)
merged_data$long_term_availability <- as.factor(merged_data$long_term_availability)
  1. Selecting columns that we want to focus on
data_new <- merged_data %>%
  select(id, name, latitude, longitude, property_type, room_type, price,
         accommodates, bedrooms, beds,bathrooms,shared_bathroom, minimum_nights, 
         maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, 
         review_scores_cleanliness, review_scores_checkin, 
         review_scores_communication, review_scores_location, review_scores_value,
         instant_bookable, Kitchen, Wifi, Air_conditioning, Elevator, Dishes_and_silverware,
         Washer, Body_soap, Microwave, Paid_parking_off_premises,
         total_amenities, short_term_availability, 
         long_term_availability, years, host_id, host_response_time,
         host_response_rate, host_acceptance_rate, host_is_superhost, host_identity_verified,
         )
  1. Exporting the data into csv format to be shared to the rest of the team member
write.csv(data_new, file = "data_new.csv", row.names = FALSE)

TLDR:

We removed several variables that we believed will not add much value to the analysis that we are going to focus on. We also removed observations with “N/A” or missing value because we believe that it was not possible to impute the data without introducing significant bias. We also did some “feature engineering” on several variables to simplify the modeling and analysis.

2. Summary Statistics

Looking at the Airbnb data for Monserrat Neighborhood, it is interesting to know what are the:

  • Price
  • Bedrooms
  • Bathrooms: Private and Shared
  • Accommodates
  • Overall Review Scores
  • Total Amenities

based on each property room type. Below is the summary statistics for each of the variables.

  1. Summary Statistics: Price
price_stats <- data_new %>%
  group_by(room_type) %>%
  summarise(
    observation = n(),            
    mean_price = mean(price, na.rm = TRUE),  
    sd_price = sd(price, na.rm = TRUE),      
    median_price = median(price, na.rm = TRUE), 
    min_price = min(price, na.rm = TRUE),     
    max_price = max(price, na.rm = TRUE))
price_stats
## # A tibble: 4 × 7
##   room_type     observation mean_price sd_price median_price min_price max_price
##   <fct>               <int>      <dbl>    <dbl>        <dbl>     <dbl>     <dbl>
## 1 Entire home/…         439     10242.   16226.        8076       1861    262857
## 2 Hotel room              7     12340.    3660.       13017       6390     16920
## 3 Private room          103      4610.    2213.        4142       2018     16642
## 4 Shared room            14      3317.    2611.        2682.       175     11596
  1. Summary Statistics: Bathrooms Private
bathrooms_private_stats <- data_new %>%
  filter(shared_bathroom=="No") %>%
  group_by(room_type) %>%
  summarise(
    observation = n(),            
    mean_bathrooms = mean(bathrooms, na.rm = TRUE), 
    sd_bathrooms = sd(bathrooms, na.rm = TRUE),      
    median_bathrooms = median(bathrooms, na.rm = TRUE), 
    min_bathrooms = min(bathrooms, na.rm = TRUE), 
    max_bathrooms = max(bathrooms, na.rm = TRUE))
bathrooms_private_stats
## # A tibble: 3 × 7
##   room_type       observation mean_bathrooms sd_bathrooms median_bathrooms
##   <fct>                 <int>          <dbl>        <dbl>            <dbl>
## 1 Entire home/apt         439           1.15        0.460                1
## 2 Hotel room                4           1           0                    1
## 3 Private room             20           1.42        0.766                1
## # ℹ 2 more variables: min_bathrooms <dbl>, max_bathrooms <dbl>
  1. Summary Statistics: Bathrooms Shared
bathrooms_shared_stats <- data_new %>%
  filter(shared_bathroom=="Yes") %>%
  group_by(room_type) %>%
  summarise(
    observation = n(),            
    mean_bathrooms = mean(bathrooms, na.rm = TRUE), 
    sd_bathrooms = sd(bathrooms, na.rm = TRUE),      
    median_bathrooms = median(bathrooms, na.rm = TRUE), 
    min_bathrooms = min(bathrooms, na.rm = TRUE), 
    max_bathrooms = max(bathrooms, na.rm = TRUE))
bathrooms_shared_stats
## # A tibble: 3 × 7
##   room_type    observation mean_bathrooms sd_bathrooms median_bathrooms
##   <fct>              <int>          <dbl>        <dbl>            <dbl>
## 1 Hotel room             3           2.67         1.15                2
## 2 Private room          83           2.08         2.05                1
## 3 Shared room           14           2.21         1.31                3
## # ℹ 2 more variables: min_bathrooms <dbl>, max_bathrooms <dbl>
  1. Summary Statistics: Bedrooms
bedrooms_stats <- data_new %>%
  group_by(room_type) %>%
  summarise(
    observation = n(),            
    mean_bedrooms = mean(bedrooms, na.rm = TRUE), 
    sd_bedrooms = sd(bedrooms, na.rm = TRUE),      
    median_bedrooms = median(bedrooms, na.rm = TRUE), 
    min_bedrooms = min(bedrooms, na.rm = TRUE), 
    max_bedrooms = max(bedrooms, na.rm = TRUE))
bedrooms_stats
## # A tibble: 4 × 7
##   room_type   observation mean_bedrooms sd_bedrooms median_bedrooms min_bedrooms
##   <fct>             <int>         <dbl>       <dbl>           <dbl>        <dbl>
## 1 Entire hom…         439          1.33       0.647               1            1
## 2 Hotel room            7          1          0                   1            1
## 3 Private ro…         103          2.21       2.72                1            1
## 4 Shared room          14          1          0                   1            1
## # ℹ 1 more variable: max_bedrooms <dbl>
  1. Summary Statistics: Accommodates
accommodates_stats <- data_new %>%
  group_by(room_type) %>%
  summarise(
    observation = n(),            
    mean_accommodates = mean(accommodates, na.rm = TRUE), 
    sd_accommodates = sd(accommodates, na.rm = TRUE),      
    median_accommodates = median(accommodates, na.rm = TRUE), 
    min_accommodates = min(accommodates, na.rm = TRUE), 
    max_accommodates = max(accommodates, na.rm = TRUE))
accommodates_stats
## # A tibble: 4 × 7
##   room_type    observation mean_accommodates sd_accommodates median_accommodates
##   <fct>              <int>             <dbl>           <dbl>               <dbl>
## 1 Entire home…         439              3.11           1.35                  3  
## 2 Hotel room             7              2              0.577                 2  
## 3 Private room         103              2.15           2.32                  2  
## 4 Shared room           14              2.36           1.86                  1.5
## # ℹ 2 more variables: min_accommodates <dbl>, max_accommodates <dbl>
  1. Summary Statistics: Overall Review Scores
review_stats <- data_new %>%
  group_by(room_type) %>%
  summarise(
    observation = n(),            
    mean_review = mean(review_scores_rating, na.rm = TRUE), 
    sd_review= sd(review_scores_rating, na.rm = TRUE),      
    median_review = median(review_scores_rating, na.rm = TRUE), 
    min_review = min(review_scores_rating, na.rm = TRUE), 
    max_review = max(review_scores_rating, na.rm = TRUE))
review_stats
## # A tibble: 4 × 7
##   room_type       observation mean_review sd_review median_review min_review
##   <fct>                 <int>       <dbl>     <dbl>         <dbl>      <dbl>
## 1 Entire home/apt         439        4.71     0.434          4.83          1
## 2 Hotel room                7        4.27     0.694          4.38          3
## 3 Private room            103        4.62     0.479          4.8           3
## 4 Shared room              14        4.74     0.422          5             4
## # ℹ 1 more variable: max_review <dbl>
  1. Summary Statistics: Total Amenities
amenities_stats <- data_new %>%
  group_by(room_type) %>%
  summarise(
    observation = n(),            
    mean_amenities = mean(total_amenities, na.rm = TRUE), 
    sd_amenities= sd(total_amenities, na.rm = TRUE),      
    median_amenities = median(total_amenities, na.rm = TRUE), 
    min_amenities = min(total_amenities, na.rm = TRUE), 
    max_amenities = max(total_amenities, na.rm = TRUE))
amenities_stats
## # A tibble: 4 × 7
##   room_type       observation mean_amenities sd_amenities median_amenities
##   <fct>                 <int>          <dbl>        <dbl>            <dbl>
## 1 Entire home/apt         439           5.44        1.44                 5
## 2 Hotel room                7           5.86        0.378                6
## 3 Private room            103           4.37        1.28                 4
## 4 Shared room              14           4.36        1.34                 5
## # ℹ 2 more variables: min_amenities <dbl>, max_amenities <dbl>

Summary

Monserrat, a charming location for vacationers, offered an array of Airbnb properties for travelers. Among the options available, entire homes or apartments proved to be the most popular, far outnumbering private rooms and shared spaces. Surprisingly, hotel rooms came out as the most expensive option, while entire homes or apartments ranked a close second.

For those who value their privacy, a property that specifies a private bathroom is essential. Interestingly, all properties with private bathrooms had one bathroom per room type on average, while those with shared bathrooms had two. Private rooms were found to have the highest average number of bedrooms, with around two per room on average. On the other hand, entire homes or apartments offered the highest average number of accommodates, which was typically around three people.

When it came to amenities, hotel rooms triumphed with the highest mean of total amenities, closely followed by entire homes or apartments. Despite the differences in amenities, all room types shared a relatively similar mean review rating, indicating that the quality of the listings was consistent across the board.

With all these options to choose from, Monserrat promises an unforgettable experience for all types of travelers.

3. Data Visualization

Looking at the airbnb data for Monserrat Neighborhood, it is interesting to visually see what are the:

  • Population
  • Price
  • Overall Review
  • Amenities Count
  • Price Trends on Different Accommodates

based on each property room type. Below is the summary statistics for each of the variables.

  1. Room Type Population
ggplot(data_new, aes(x = room_type, y = ..count.., fill = room_type)) +
  geom_bar(alpha = 0.7, width = 0.5) +
  labs(x = "Room Type", y = "Count", fill = "Room Type") +
  scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
        legend.position = "bottom",
        axis.title = element_text(face = "bold", size = 14),
        axis.text = element_text(size = 12)) +
  ggtitle("Number of Property Based on Room Type")

  1. Room Type Price

note: there are two outliers price point for the “entire home/apt” room type (262,857 USD and 216,521 USD). Those two outliers were removed to show a better visualization

# Remove two maximum values of price for entire home/apt
data_new_clean <- data_new %>%
  filter(!(room_type == "Entire home/apt" & price %in% tail(sort(price), 2)))

ggplot(data_new_clean, aes(x = room_type, y = price, fill = room_type)) +
  geom_boxplot(alpha = 0.7, width = 0.5) +
  labs(x = "Room Type", y = "Price", fill = "Room Type") +
  scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
  scale_y_continuous(labels = dollar_format(prefix = "$")) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
        legend.position = "bottom",
        axis.title = element_text(face = "bold", size = 14),
        axis.text = element_text(size = 12)) +
  ggtitle("Price Distribution by Room Type")

  1. Overall Review Score per Room Type
ggplot(data_new, aes(x = room_type, y = review_scores_rating, fill = room_type)) +
  geom_violin(scale = "width", alpha = 0.7) +
  labs(x = "Room Type", y = "Review Scores Rating", fill = "Room Type") +
  scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
        legend.position = "bottom",
        axis.title = element_text(face = "bold", size = 14),
        axis.text = element_text(size = 12)) +
  ggtitle("Review Scores Rating Distribution by Room Type")

  1. Distribution of Amenities Number per Room Type
# Calculate the sum of each amenity by room type
amenities_sum_by_roomtype <- data_new %>%
  select(room_type, Kitchen, Wifi, Air_conditioning, Elevator, Dishes_and_silverware, Washer, Body_soap, Microwave) %>%
  mutate(across(Kitchen: Microwave, as.numeric)) %>%
  group_by(room_type) %>%
  summarize_all(sum)

# Reshape data to long format for plotting
amenities_sum_by_roomtype_long <- amenities_sum_by_roomtype %>%
  pivot_longer(cols = -room_type, names_to = "amenity", values_to = "count") %>%
  arrange(room_type, desc(count))

# Create stacked bar plot
ggplot(amenities_sum_by_roomtype_long, aes(x = amenity, y = count, fill = room_type)) +
  geom_col() +
  scale_fill_manual(values = c("#F8766D", "#00BA38", "#619CFF", "#DA3B3A")) +
  labs(x = "Amenities", y = "Number of Listings", fill = "Room Type") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
        legend.position = "bottom",
        axis.title = element_text(face = "bold", size = 14),
        axis.text = element_text(size = 12),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Amenities by Room Type")

  1. Price Trend on Different Accommodates Capacity per Room Type
my_colors <- c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")

ggplot(data_new, aes(x = accommodates, y = price, color = room_type)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(values = my_colors) +
  scale_y_continuous(labels = dollar_format(prefix = "$")) +
  labs(x = "Accommodates", y = "Price", color = "Room Type") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
        legend.position = "bottom",
        axis.title = element_text(face = "bold", size = 14),
        axis.text = element_text(size = 12)) +
  ggtitle("Scatterplot of Price and Accommodates by Room Type")

Summary

Nestled in the stunning location of Monserrat, vacationers have an array of Airbnb properties to choose from. Dominating the market with around 400 listings, entire homes or apartments were the most popular option, followed by private rooms with around 100 listings. In contrast, the number of listings for hotel and shared rooms was relatively low.

When it comes to price, hotel rooms reign supreme as the most expensive option, followed by entire homes or apartments. Surprisingly, shared rooms were found to be the cheapest option. Entire homes or apartments boasted the broadest range of prices compared to the rest of the property room types, making them an attractive option for budget-conscious travelers.

The review ratings for all room types in Monserrat were relatively consistent, with no significant differences among them. However, entire homes or apartments had the broadest range of review ratings, spanning from 4.7 to 1. This highlights the importance of reading through reviews thoroughly before making a booking.

If amenities are essential, then entire homes or apartments would be the go-to option in Monserrat. They offer the highest number of amenities compared to the other room types. From free Wi-Fi to essential kitchen supplies, these properties cater to the needs of all types of travelers.

Interestingly, the number of accommodates does not seem to affect rental prices for all room types in Monserrat. This opens up an opportunity for larger groups to enjoy a budget-friendly stay without having to worry about spending more for the same property.

All in all, Monserrat is an excellent location for vacationers, with Airbnb properties offering something for everyone.

4. Mapping

m <- leaflet() %>% addTiles() %>% addCircles(data = data_new, lng= ~longitude , lat= ~latitude)%>% addProviderTiles(providers$JusticeMap.income)
m

Description:

The neighborhood Monserrat is adjacent to the natural reservoir and Laguna de los Patos. Besides the nature, Monserrat has notable landmarks, such as the Casa Rosada and Plaza de Mayo, where the first is the presidential palace of Argentina and serves as the executive office of the President and the second is a historic public square that has been the site of many important political events in Argentina’s history.

5. Word Cloud

# Split neighborhood_overview column into words and create a new dataframe
words <- master_data %>%
  select(neighborhood_overview) %>%
  unnest_tokens(word, neighborhood_overview)

# Create a custom list of stop words
custom_stopwords <- c(stop_words$word, "de", "la")

# Remove stop words and create a word frequency table
word_freq <- words %>%
  anti_join(stop_words, by = "word") %>%
  anti_join(data.frame(word = custom_stopwords), by = "word") %>%
  count(word, sort = TRUE)
# Set the size of the graphics device
options(repr.plot.width = 8, repr.plot.height = 8)

# Generate a word cloud
wordcloud(words = word_freq$word, freq = word_freq$n, min.freq = 1,
          max.words = 200, random.order = FALSE, rot.per = 0.5, 
          colors = brewer.pal(8, "Dark2"))

The words in the word cloud are all related to the neighborhoods and landmarks in Buenos Aires, and their prominence in the word cloud can provide insights into the most frequent and important words in the neighborhood overview column of the Buenos Aires Airbnb dataset.

“Br” is likely to stand for “Barrio” or “neighborhood” in Spanish, and its prominence in the word cloud suggests that the neighborhood overview column frequently mentions different neighborhoods in Buenos Aires. “San” is an honorific title used in place names, and its appearance in the word cloud suggests that the neighborhood overview column may include references to different streets, districts, or landmarks with this title.

“Telmo” refers to the San Telmo neighborhood in Buenos Aires, which is known for its historic architecture, tango culture, and antique markets. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of this neighborhood and its characteristics.

“Buenos” and “Aires” refer to the city of Buenos Aires, which is the capital of Argentina and one of the largest cities in South America. The appearance of these terms in the word cloud suggests that the neighborhood overview column may include descriptions of different neighborhoods and landmarks within the city.

“Mayo” refers to the Plaza de Mayo, which is a public square in the heart of Buenos Aires that is known for its historical and political significance. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of this landmark and its role in the city’s history.

“Plaza” refers to public squares and plazas, which are common features in many neighborhoods in Buenos Aires. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of different plazas and their characteristics.

Step II: Prediction

The multiple regression model were constructed in the following steps.

  1. Defining data
MLR <- data_new
  1. Convert the price variable using a log transformation
MLR$price <- log(MLR$price)
  1. Checking the uniqueness of the categorical variables.

By using the length() and unique() functions, we were able to identify the unique values of the categorical variable in the dataset. Based on the results, we have decided to remove several variables, namely id, host_id, name, latitude, and longitude from the dataset used for multiple linear regression (MLR) because of its irrelevancy to the MLR.

Furthermore, the property_type variable is a subtype of the room_type variable, and as such, it will also be removed.

# Looking for number of unique value
length(unique(MLR$id))
## [1] 563
length(unique(MLR$host_id))
## [1] 352
length(unique(MLR$name))
## [1] 536
length(unique(MLR$latitude))
## [1] 428
length(unique(MLR$longitude))
## [1] 464
length(unique(MLR$property_type))
## [1] 22
length(unique(MLR$room_type))
## [1] 4
length(unique(MLR$host_response_time))
## [1] 4
# Removing the id, host_id, name, property_type, latitude, and longitude variable
MLR_clean <- subset(MLR, select=c(-id, -host_id, -name, -property_type, -latitude, -longitude))
  1. Check the numeric variables’ correlation.

According to the results below, there are some variables that have relationship value >= 0.80: review_scores_rating & review_scores_accuracy, review_scores_rating & review_scores_value, and review_scores_accuracy & review_scores_value. Therefore, the review_scores_rating and review_scores_value variable will be removed from the dataset.

library(corrplot)

# Calculating correlation between numeric variables
Corr <- cor(MLR_clean %>% 
              select(c(accommodates, bedrooms, beds, 
                       minimum_nights, maximum_nights,number_of_reviews,
                       review_scores_rating, review_scores_accuracy, 
                       review_scores_cleanliness,review_scores_checkin,
                       review_scores_communication, review_scores_location,
                       review_scores_value, bathrooms, host_response_rate, 
                       host_acceptance_rate, years)))
print(Corr)
##                             accommodates    bedrooms         beds
## accommodates                  1.00000000  0.39163873  0.776193337
## bedrooms                      0.39163873  1.00000000  0.531683944
## beds                          0.77619334  0.53168394  1.000000000
## minimum_nights               -0.13467965  0.17777513 -0.027669107
## maximum_nights                0.10563102  0.04177133  0.103623324
## number_of_reviews             0.04056428 -0.06457375  0.011063160
## review_scores_rating          0.02338411 -0.03305670 -0.025240309
## review_scores_accuracy        0.04624737 -0.03039344 -0.016314203
## review_scores_cleanliness     0.03412964 -0.06843784 -0.038610731
## review_scores_checkin         0.07016071  0.03541220  0.032680840
## review_scores_communication   0.07481748  0.02091280  0.046527560
## review_scores_location        0.01134373  0.02137466  0.031129775
## review_scores_value           0.05199863 -0.06058039  0.004536392
## bathrooms                     0.35909931  0.53828307  0.529161985
## host_response_rate            0.02777254 -0.02327931 -0.005885384
## host_acceptance_rate         -0.05074998 -0.05396403 -0.118515085
## years                         0.06097251  0.02816177  0.008185991
##                             minimum_nights maximum_nights number_of_reviews
## accommodates                  -0.134679647    0.105631016        0.04056428
## bedrooms                       0.177775127    0.041771334       -0.06457375
## beds                          -0.027669107    0.103623324        0.01106316
## minimum_nights                 1.000000000    0.103833378       -0.11052285
## maximum_nights                 0.103833378    1.000000000        0.03490829
## number_of_reviews             -0.110522847    0.034908288        1.00000000
## review_scores_rating          -0.003509136    0.022874314        0.06583003
## review_scores_accuracy        -0.001719427    0.030143104        0.10748644
## review_scores_cleanliness     -0.106790913    0.010408507        0.10563770
## review_scores_checkin         -0.019165431   -0.020121012        0.08515405
## review_scores_communication    0.016305678    0.071009596        0.07786747
## review_scores_location         0.055976142    0.004447184        0.07514351
## review_scores_value           -0.058117135    0.061768776        0.10479946
## bathrooms                      0.145787850    0.134600981       -0.04965522
## host_response_rate            -0.040275899    0.043639074        0.09637063
## host_acceptance_rate          -0.072882698   -0.066730768        0.11985212
## years                          0.114052990    0.030606622        0.18649473
##                             review_scores_rating review_scores_accuracy
## accommodates                         0.023384108            0.046247367
## bedrooms                            -0.033056704           -0.030393437
## beds                                -0.025240309           -0.016314203
## minimum_nights                      -0.003509136           -0.001719427
## maximum_nights                       0.022874314            0.030143104
## number_of_reviews                    0.065830033            0.107486439
## review_scores_rating                 1.000000000            0.844042444
## review_scores_accuracy               0.844042444            1.000000000
## review_scores_cleanliness            0.761956529            0.752591986
## review_scores_checkin                0.626726336            0.624085514
## review_scores_communication          0.618736136            0.557996063
## review_scores_location               0.484105027            0.471564816
## review_scores_value                  0.834415398            0.803564620
## bathrooms                           -0.080196253           -0.105078662
## host_response_rate                   0.086576097            0.111045472
## host_acceptance_rate                 0.077495718            0.116594755
## years                                0.036039269            0.035634714
##                             review_scores_cleanliness review_scores_checkin
## accommodates                               0.03412964            0.07016071
## bedrooms                                  -0.06843784            0.03541220
## beds                                      -0.03861073            0.03268084
## minimum_nights                            -0.10679091           -0.01916543
## maximum_nights                             0.01040851           -0.02012101
## number_of_reviews                          0.10563770            0.08515405
## review_scores_rating                       0.76195653            0.62672634
## review_scores_accuracy                     0.75259199            0.62408551
## review_scores_cleanliness                  1.00000000            0.52823467
## review_scores_checkin                      0.52823467            1.00000000
## review_scores_communication                0.40101096            0.65093460
## review_scores_location                     0.35933422            0.45327165
## review_scores_value                        0.69753638            0.58032383
## bathrooms                                 -0.13364166           -0.08655867
## host_response_rate                         0.05542183            0.09059675
## host_acceptance_rate                       0.16030786            0.09636437
## years                                      0.04380719            0.05063207
##                             review_scores_communication review_scores_location
## accommodates                                 0.07481748            0.011343732
## bedrooms                                     0.02091280            0.021374658
## beds                                         0.04652756            0.031129775
## minimum_nights                               0.01630568            0.055976142
## maximum_nights                               0.07100960            0.004447184
## number_of_reviews                            0.07786747            0.075143511
## review_scores_rating                         0.61873614            0.484105027
## review_scores_accuracy                       0.55799606            0.471564816
## review_scores_cleanliness                    0.40101096            0.359334218
## review_scores_checkin                        0.65093460            0.453271650
## review_scores_communication                  1.00000000            0.422406115
## review_scores_location                       0.42240611            1.000000000
## review_scores_value                          0.52970001            0.514722162
## bathrooms                                   -0.02987541           -0.034024904
## host_response_rate                           0.04161985            0.048193586
## host_acceptance_rate                        -0.01617061            0.003187999
## years                                        0.10942055            0.052420460
##                             review_scores_value   bathrooms host_response_rate
## accommodates                        0.051998634  0.35909931        0.027772542
## bedrooms                           -0.060580389  0.53828307       -0.023279309
## beds                                0.004536392  0.52916199       -0.005885384
## minimum_nights                     -0.058117135  0.14578785       -0.040275899
## maximum_nights                      0.061768776  0.13460098        0.043639074
## number_of_reviews                   0.104799458 -0.04965522        0.096370626
## review_scores_rating                0.834415398 -0.08019625        0.086576097
## review_scores_accuracy              0.803564620 -0.10507866        0.111045472
## review_scores_cleanliness           0.697536376 -0.13364166        0.055421835
## review_scores_checkin               0.580323826 -0.08655867        0.090596748
## review_scores_communication         0.529700012 -0.02987541        0.041619848
## review_scores_location              0.514722162 -0.03402490        0.048193586
## review_scores_value                 1.000000000 -0.09202214        0.097950024
## bathrooms                          -0.092022136  1.00000000       -0.090084227
## host_response_rate                  0.097950024 -0.09008423        1.000000000
## host_acceptance_rate                0.075671278 -0.14541633        0.439467106
## years                              -0.009963601  0.02339579        0.004264235
##                             host_acceptance_rate        years
## accommodates                        -0.050749977  0.060972512
## bedrooms                            -0.053964029  0.028161773
## beds                                -0.118515085  0.008185991
## minimum_nights                      -0.072882698  0.114052990
## maximum_nights                      -0.066730768  0.030606622
## number_of_reviews                    0.119852122  0.186494732
## review_scores_rating                 0.077495718  0.036039269
## review_scores_accuracy               0.116594755  0.035634714
## review_scores_cleanliness            0.160307863  0.043807187
## review_scores_checkin                0.096364374  0.050632067
## review_scores_communication         -0.016170611  0.109420548
## review_scores_location               0.003187999  0.052420460
## review_scores_value                  0.075671278 -0.009963601
## bathrooms                           -0.145416330  0.023395789
## host_response_rate                   0.439467106  0.004264235
## host_acceptance_rate                 1.000000000 -0.009850326
## years                               -0.009850326  1.000000000
# Plotting the correlation
corrplot(Corr, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

# Removing the review_scores_rating and review_scores_value variable
MLR_fix <- subset(MLR_clean, select=c(-review_scores_rating, -review_scores_value))
  1. Data Partioning.

Using the sample() function, the MLR_fix data frame was randomly assigned to train.df for 60% of the data, and the rest is assigned to the valid.df.

set.seed(62)
train.index <- sample(c(1:nrow(MLR_fix)), nrow(MLR_fix)*0.6)
train.df <- MLR_fix[train.index, ]
valid.df <- MLR_fix[-train.index, ]
  1. Creating multiple regression model with all variables in training dataset
MLR_all <- lm(price~ ., data=train.df)
summary(MLR_all)
## 
## Call:
## lm(formula = price ~ ., data = train.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08776 -0.25897 -0.02813  0.24625  3.08743 
## 
## Coefficients: (1 not defined because of singularities)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           8.540e+00  5.946e-01  14.363  < 2e-16 ***
## room_typeHotel room                   2.969e-01  3.442e-01   0.862  0.38918    
## room_typePrivate room                -5.178e-01  1.566e-01  -3.306  0.00106 ** 
## room_typeShared room                 -1.132e+00  2.502e-01  -4.526 8.68e-06 ***
## accommodates                          1.340e-01  2.550e-02   5.256 2.80e-07 ***
## bedrooms                             -2.155e-02  2.917e-02  -0.739  0.46059    
## beds                                 -2.800e-02  3.012e-02  -0.930  0.35322    
## bathrooms                             2.223e-02  4.235e-02   0.525  0.60006    
## shared_bathroomYes                   -1.484e-01  1.566e-01  -0.948  0.34395    
## minimum_nights                       -4.218e-03  3.925e-03  -1.075  0.28344    
## maximum_nights                        9.415e-05  6.035e-05   1.560  0.11978    
## number_of_reviews                    -3.743e-04  5.712e-04  -0.655  0.51277    
## review_scores_accuracy                1.052e-01  1.021e-01   1.030  0.30366    
## review_scores_cleanliness            -2.355e-02  8.180e-02  -0.288  0.77365    
## review_scores_checkin                -2.507e-02  1.207e-01  -0.208  0.83564    
## review_scores_communication           2.807e-02  1.029e-01   0.273  0.78518    
## review_scores_location               -5.063e-02  1.062e-01  -0.477  0.63386    
## instant_bookableTRUE                  5.675e-02  6.121e-02   0.927  0.35455    
## Kitchen1                             -7.343e-02  1.563e-01  -0.470  0.63889    
## Wifi1                                 1.685e-01  1.066e-01   1.581  0.11485    
## Air_conditioning1                    -5.882e-02  6.108e-02  -0.963  0.33638    
## Elevator1                            -1.466e-01  6.251e-02  -2.346  0.01965 *  
## Dishes_and_silverware1                1.229e-01  9.498e-02   1.294  0.19671    
## Washer1                               1.049e-01  7.284e-02   1.440  0.15099    
## Body_soap1                           -3.179e-02  6.229e-02  -0.510  0.61016    
## Microwave1                            3.363e-02  6.928e-02   0.485  0.62771    
## Paid_parking_off_premises1            4.244e-02  6.298e-02   0.674  0.50097    
## total_amenities                              NA         NA      NA       NA    
## short_term_availability1              1.260e-01  5.772e-02   2.183  0.02983 *  
## long_term_availability1               9.220e-03  5.513e-02   0.167  0.86731    
## years                                -7.816e-03  8.547e-03  -0.914  0.36119    
## host_response_timewithin a day        1.443e-01  4.051e-01   0.356  0.72198    
## host_response_timewithin a few hours  5.390e-01  4.565e-01   1.181  0.23865    
## host_response_timewithin an hour      5.254e-01  4.642e-01   1.132  0.25861    
## host_response_rate                   -2.829e-01  3.978e-01  -0.711  0.47748    
## host_acceptance_rate                 -4.690e-01  1.729e-01  -2.713  0.00705 ** 
## host_is_superhostTRUE                 8.149e-02  6.476e-02   1.258  0.20924    
## host_identity_verifiedTRUE           -1.882e-02  9.441e-02  -0.199  0.84210    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4635 on 300 degrees of freedom
## Multiple R-squared:  0.5071, Adjusted R-squared:  0.4479 
## F-statistic: 8.573 on 36 and 300 DF,  p-value: < 2.2e-16
  1. Performing stepwise regression.

  2. Assess the accuracy of the model against both the training set and the validation set

# Accuracy against training dataset
pred_tm <- predict(MLR.step, train.df)
accuracy(pred_tm, train.df$price)
##                   ME      RMSE       MAE        MPE     MAPE
## Test set 9.85429e-15 0.4452499 0.3171445 -0.2505931 3.573375
# Accuracy againts validation dataset
pred_vm <- predict(MLR.step, valid.df)
accuracy(pred_vm, valid.df$price)
##                   ME      RMSE       MAE        MPE     MAPE
## Test set -0.01310809 0.3997252 0.2954494 -0.3040757 3.337518
# RMSE gap between training and validation dataset
RMSE_gap <- (0.5370704-0.3925947)/0.3925947
print(RMSE_gap)
## [1] 0.3680022
# MAE gap between training and validation dataset
MAE_gap <- (0.3370304-0.3060701)/0.3060701
print(MAE_gap)
## [1] 0.1011543

Step III: Classification

Classification Part I: K Nearest Neighbors

The KNN predictive model was constructed using these following steps to predict certain rental properties in Monserrat will have Kitchen amenities or not.

  1. Picking the third observation of rental property in Monserrat Neighborhood and removing its amenities information for test observation.
rental <- data_new[3, ] %>%
  select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights,
         maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
         review_scores_cleanliness, review_scores_checkin, review_scores_communication,
         review_scores_location, review_scores_value, years,
         host_response_rate, host_acceptance_rate)
  1. Building new numeric dataframe for KNN model building. The numeric predictors chosen here were price, accommodates, bedrooms, beds, bathrooms, minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, years, host_response_rate, host_acceptance_rate to align with what the team has chosen previously. The predictors chosen were only numeric since KNN will rely on distance matrix for the modeling.
knn_var <- data_new %>%
  select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights,
         maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
         review_scores_cleanliness, review_scores_checkin, review_scores_communication,
         review_scores_location, review_scores_value, years,
         host_response_rate, host_acceptance_rate, Kitchen, id)
  1. Partitioning Dataset into Training and Validation set
# Setting seed for reproducibility
set.seed(250)

# Random sampling the dataset index without replacement with 60% for training set
train_index_knn <- sample(c(1:nrow(knn_var)), nrow(knn_var)*0.6) 

# Partition the dataset into training and validation set based on the index sampling
train_df_knn <- knn_var[train_index_knn, ]
valid_df_knn <- knn_var[-train_index_knn, ]
  1. Normalizing the Dataset

Normalization was done due to the different scale for each predictor variable

# Initializing normalized training, validation data, complete dataframe to originals
train_norm_df_knn <- train_df_knn
valid_norm_df_knn <- valid_df_knn
knn_var_norm<- knn_var

# Using preProcess () from the caret package to normalize predictor variables
norm_values_knn <- preProcess(train_df_knn[,1:18], method=c("center", "scale"))
train_norm_df_knn[,1:18] <- predict(norm_values_knn, train_df_knn[,1:18])
valid_norm_df_knn[,1:18] <- predict(norm_values_knn, valid_df_knn[,1:18])
knn_var_norm[,1:18] <- predict(norm_values_knn, knn_var[,1:18])

# Normalizing rental dataframe
rental_norm <- predict(norm_values_knn, rental)
  1. Building KNN Predictive Model with arbitrary k=7
# Creating knn model to predict whether rental has kitchen amenities
rental_nn <- knn(train=train_norm_df_knn[,1:18], test=rental_norm, cl=train_norm_df_knn$Kitchen, k=7)

# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance
attributes(rental_nn)
## $levels
## [1] "1"
## 
## $class
## [1] "factor"
## 
## $nn.index
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  174   18  113  270  104   32    4
## 
## $nn.dist
##      [,1]     [,2]     [,3]    [,4]     [,5]     [,6]     [,7]
## [1,]    0 1.519194 1.690486 1.69414 1.707484 1.881108 1.909966
  1. Determining optimal value for k
# Initialize a data frame with two columns: k, and accuracy
accuracy_df_knn <- data.frame(k=seq(1,14,1), accuracy=rep(0,14))

# Compute knn for different k on validation
for(i in 1:14){
  knn.pred <- knn(train_norm_df_knn[,1:18], valid_norm_df_knn[,1:18], 
                  cl = train_norm_df_knn$Kitchen, k=i)
  accuracy_df_knn[i,2] <- confusionMatrix(knn.pred, valid_norm_df_knn$Kitchen)$overall[1] %>% round(3)
}
accuracy_df_knn
##     k accuracy
## 1   1    0.951
## 2   2    0.938
## 3   3    0.960
## 4   4    0.965
## 5   5    0.951
## 6   6    0.951
## 7   7    0.947
## 8   8    0.947
## 9   9    0.947
## 10 10    0.947
## 11 11    0.947
## 12 12    0.947
## 13 13    0.947
## 14 14    0.947
  1. Building KNN Predictive Model with optimum k=4

Optimum k=4 were chosen based on the highest accuracy when the model was tested againts the validation set.

# Creating knn model to predict whether rental has kitchen amenities
rental_nn <- knn(train=train_norm_df_knn[,1:18], test=rental_norm, cl=train_norm_df_knn$Kitchen, k=4)

# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance
attributes(rental_nn)
## $levels
## [1] "1"
## 
## $class
## [1] "factor"
## 
## $nn.index
##      [,1] [,2] [,3] [,4]
## [1,]  174   18  113  270
## 
## $nn.dist
##      [,1]     [,2]     [,3]    [,4]
## [1,]    0 1.519194 1.690486 1.69414
  1. Checking with actual data
data_new[3,24]
## # A tibble: 1 × 1
##   Kitchen
##   <fct>  
## 1 1

Summary

To predict whether certain rental properties in the Monserrat neighborhood had kitchen amenities or not, a KNN predictive model was constructed through a series of steps. Firstly, the third observation of a rental property in Monserrat was selected, and its amenities information was removed to create a test observation. Next, a new numeric dataframe was built for KNN model building using a range of predictors such as price, accommodates, bedrooms, beds, bathrooms, minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, years, host_response_rate, and host_acceptance_rate. Only numeric predictors were chosen as KNN relies on a distance matrix for modeling.

Furthermore, the dataset was partitioned into training and validation sets, and normalization was done to account for the different scale of each predictor variable. Following this, a KNN predictive model was built with an arbitrary k value of 7. These steps laid the foundation for creating a model that could predict which rental properties in the Monserrat neighborhood would have kitchen amenities.

To refine the model, an optimal value for k was determined. This was done by testing the model against the validation set, and the highest accuracy was used to select the optimal value for k, which was found to be k=4. Finally, a KNN predictive model was built using k=4, which was then used to predict whether rental properties in the Monserrat neighborhood would have kitchen amenities or not. By using a range of predictors and an optimal k value, the KNN predictive model was able to provide accurate predictions on whether the third observation might have Kitchen amenities or not.

Classification Part II: Naive Bayes

The Naive Bayes modeling was done through the following steps

  1. Create a new dataset with variable of focus for Naive Bayes modeling
# Importing data
merged2 <- data_new

# Create a vector of column names to keep 
keep_vars <- c("property_type", "room_type", "accommodates", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "number_of_reviews", "instant_bookable", "bathrooms", "shared_bathroom", "Kitchen", "Wifi", "Air_conditioning", "Elevator", "Dishes_and_silverware", "Washer", "Body_soap", "Microwave", "Paid_parking_off_premises", "host_response_rate", "host_identity_verified", "short_term_availability", "long_term_availability","review_scores_rating")

# Subset the merged2 dataframe to keep only the selected columns
merged2 <- subset(merged2, select = keep_vars)
  1. Binning the numerical variables into categorical variables of equal frequency using cut function
# Binning 'accommodates'
quantiles <- quantile(merged2$accommodates, probs = c(0.5))
breaks <- c(0, quantiles, Inf)
labels <- c("Small", "Large")
merged2$accommodates <- cut(merged2$accommodates, breaks = breaks, labels = labels)
table(merged2$accommodates)
## 
## Small Large 
##   304   259
# Binning 'bedrooms'
merged2$bedrooms <- cut(merged2$bedrooms, breaks = c(0, 1, 2, Inf), labels = c("1-2", "3-4", "5+"))

# Binning 'beds'
merged2$beds <- cut(merged2$beds, breaks = c(0, 1, 2, Inf), labels = c("1-2", "3-4", "5+"))

# Binning 'bathrooms'
merged2$bathrooms <- cut(merged2$bathrooms, breaks = c(0, 1, 2, 3, Inf), labels = c("1", "2", "3", "4+"))

# Binning 'minimum_nights'
merged2$minimum_nights <- cut(merged2$minimum_nights + runif(nrow(merged2), -0.0001, 0.0001), 
                              breaks = quantile(merged2$minimum_nights + runif(nrow(merged2), -0.0001, 0.0001), probs = seq(0, 1, 0.25)), 
                              labels = c("1", "2", "3", "4+"))

# Add small amount of noise to 'maximum_nights'
merged2$maximum_nights <- merged2$maximum_nights + runif(nrow(merged2), -0.0001, 0.0001)
# Binning 'maximum_nights'
merged2$maximum_nights <- cut(merged2$maximum_nights, breaks = quantile(merged2$maximum_nights, probs = seq(0, 1, 0.25)), labels = c("1-3", "4-7", "8-14", "15+"))
# Binning 'number_of_reviews'
merged2$number_of_reviews <- cut(merged2$number_of_reviews, breaks = quantile(merged2$number_of_reviews, probs = seq(0, 1, 0.25)), labels = c("1-7", "8-23", "24-56", "57+"))
# Binning 'host_response_rate'
merged2$host_response_rate <- cut(jitter(merged2$host_response_rate), 
                                  breaks = quantile(jitter(merged2$host_response_rate), 
                                                    probs = seq(0, 1, 0.25), 
                                                    na.rm = TRUE), 
                                  labels = c("<75%", "75-94%", "95-99%", "100%"))
# Binning 'review_scores_rating'
# Add jitter to the data
merged2$review_scores_rating<- jitter(merged2$review_scores_rating, amount = 0.001)
quantiles <- quantile(merged2$review_scores_rating, probs = seq(0, 1, 0.25), na.rm = TRUE)

if (length(unique(quantiles)) == length(quantiles)) {
  # Bin the data
  merged2$review_scores_rating <- cut(merged2$review_scores_rating,
                                      breaks = quantiles,
                                      labels = c("<80", "80-90", "90-95", "95+"),
                                      include.lowest = TRUE)
} else {
  cat("Quantiles are not unique. Please consider using different probabilities or jitter amount.")
}

# Binning 'Price'
# Calculate the quantiles for equal frequency binning
quantiles <- quantile(merged2$price, probs = seq(0, 1, length.out = 3 + 1), na.rm = TRUE, type = 5)
# Generate labels for the bins
bin_labels <- c("Low", "Medium", "High")
# Bin the data
merged2$price <- cut(merged2$price, breaks = quantiles, labels = bin_labels, include.lowest = TRUE)
  1. Creating Proportional Barplot for Feature Selection to be loaded into Naive Bayes Model
# Select the categorical variables
variables <- c("property_type", "room_type", "accommodates", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "number_of_reviews", "bathrooms", "shared_bathroom", "Kitchen", "Wifi", "Air_conditioning", "Elevator", "Dishes_and_silverware", "Washer", "Body_soap", "Microwave", "Paid_parking_off_premises", "host_response_rate", "host_identity_verified", "short_term_availability", "long_term_availability", "review_scores_rating")

# Reshape the dataset
merged2_long <- merged2 %>%
  select(one_of(variables), instant_bookable) %>%
  gather(key = "variable", value = "value", -instant_bookable)
## Warning: attributes are not identical across measure variables; they will be
## dropped
# Create the faceted barplot
p <- ggplot(merged2_long, aes(x = value, fill = instant_bookable)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  facet_wrap(~variable, scales = "free_x", ncol = 5) +
  xlab("Value") +
  ylab("Count") +
  scale_fill_discrete(name = "Instant Bookable")

print(p)

based on the barplot it appears that the longterm availability, short term availability, air_conditioning,beds, number_of_reviews, price, minimum nights, review_score_rating variable may not have a strong amount of predictive power in a naive Bayes model as the distribution is relatively similar. so we gonna remove it

  1. Removing Variable with Weak Predictive Power
# List of variables to remove
variables_to_remove <- c("long_term_availability", "short_term_availability", "Air_conditioning", "number_of_reviews", "price", "minimum_nights", "review_scores_rating")

# Remove the variables
merged2 <- merged2 %>%
  select(-one_of(variables_to_remove))
  1. Building the Naive Bayes Prediction Model
# Set the seed for reproducibility
set.seed(42)

# Create an 60-40 split for training and testing sets
train_index <- createDataPartition(merged2$instant_bookable, p = 0.6, list = FALSE)
train_set <- merged2[train_index, ]
test_set <- merged2[-train_index, ]

# Build the Naive Bayes model using naiveBayes() function
nb_model <- naiveBayes(instant_bookable ~ ., data = train_set)

# Summary of the model
print(nb_model)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     FALSE      TRUE 
## 0.5739645 0.4260355 
## 
## Conditional probabilities:
##        property_type
## Y       Entire condo Entire home Entire loft Entire rental unit
##   FALSE  0.139175258 0.005154639 0.041237113        0.551546392
##   TRUE   0.111111111 0.006944444 0.048611111        0.423611111
##        property_type
## Y       Entire serviced apartment Entire vacation home
##   FALSE               0.056701031          0.020618557
##   TRUE                0.083333333          0.013888889
##        property_type
## Y       Private room in bed and breakfast Private room in casa particular
##   FALSE                       0.030927835                     0.010309278
##   TRUE                        0.000000000                     0.000000000
##        property_type
## Y       Private room in condo Private room in guesthouse Private room in home
##   FALSE           0.015463918                0.000000000          0.020618557
##   TRUE            0.055555556                0.000000000          0.000000000
##        property_type
## Y       Private room in hostel Private room in pension
##   FALSE            0.000000000             0.005154639
##   TRUE             0.006944444             0.000000000
##        property_type
## Y       Private room in rental unit Private room in serviced apartment
##   FALSE                 0.077319588                        0.015463918
##   TRUE                  0.104166667                        0.006944444
##        property_type
## Y       Room in hostel Room in hotel Shared room in condo Shared room in hostel
##   FALSE    0.000000000   0.000000000          0.000000000           0.000000000
##   TRUE     0.034722222   0.055555556          0.000000000           0.013888889
##        property_type
## Y       Shared room in hotel Shared room in rental unit
##   FALSE          0.005154639                0.005154639
##   TRUE           0.000000000                0.027777778
##        property_type
## Y       Shared room in serviced apartment
##   FALSE                       0.000000000
##   TRUE                        0.006944444
## 
##        room_type
## Y       Entire home/apt Hotel room Private room Shared room
##   FALSE      0.81443299 0.00000000   0.17525773  0.01030928
##   TRUE       0.68750000 0.03472222   0.22916667  0.04861111
## 
##        accommodates
## Y           Small     Large
##   FALSE 0.5463918 0.4536082
##   TRUE  0.6041667 0.3958333
## 
##        bedrooms
## Y              1-2        3-4         5+
##   FALSE 0.76804124 0.17010309 0.06185567
##   TRUE  0.77083333 0.13888889 0.09027778
## 
##        beds
## Y             1-2       3-4        5+
##   FALSE 0.5051546 0.2371134 0.2577320
##   TRUE  0.5208333 0.2152778 0.2638889
## 
##        maximum_nights
## Y             1-3       4-7      8-14       15+
##   FALSE 0.1917098 0.2746114 0.2383420 0.2953368
##   TRUE  0.3055556 0.2569444 0.2708333 0.1666667
## 
##        bathrooms
## Y                1          2          3         4+
##   FALSE 0.78350515 0.13402062 0.02061856 0.06185567
##   TRUE  0.86111111 0.09027778 0.02777778 0.02083333
## 
##        shared_bathroom
## Y              No       Yes
##   FALSE 0.8402062 0.1597938
##   TRUE  0.7638889 0.2361111
## 
##        Kitchen
## Y                 0           1
##   FALSE 0.005154639 0.994845361
##   TRUE  0.076388889 0.923611111
## 
##        Wifi
## Y                0          1
##   FALSE 0.09793814 0.90206186
##   TRUE  0.08333333 0.91666667
## 
##        Elevator
## Y               0         1
##   FALSE 0.3762887 0.6237113
##   TRUE  0.5277778 0.4722222
## 
##        Dishes_and_silverware
## Y                0          1
##   FALSE 0.08762887 0.91237113
##   TRUE  0.25000000 0.75000000
## 
##        Washer
## Y               0         1
##   FALSE 0.7164948 0.2835052
##   TRUE  0.7222222 0.2777778
## 
##        Body_soap
## Y               0         1
##   FALSE 0.7010309 0.2989691
##   TRUE  0.7638889 0.2361111
## 
##        Microwave
## Y               0         1
##   FALSE 0.2938144 0.7061856
##   TRUE  0.4375000 0.5625000
## 
##        Paid_parking_off_premises
## Y               0         1
##   FALSE 0.6907216 0.3092784
##   TRUE  0.6875000 0.3125000
## 
##        host_response_rate
## Y            <75%    75-94%    95-99%      100%
##   FALSE 0.1907216 0.2938144 0.3350515 0.1804124
##   TRUE  0.3496503 0.2377622 0.2097902 0.2027972
## 
##        host_identity_verified
## Y            FALSE       TRUE
##   FALSE 0.08247423 0.91752577
##   TRUE  0.06944444 0.93055556
# Generate predictions for the test set
predictions <- predict(nb_model, test_set)

# Convert predictions and test_set$instant_bookable to factors
predictions_factor <- factor(predictions, levels = c("FALSE", "TRUE"))
test_set_factor <- factor(test_set$instant_bookable, levels = c("FALSE", "TRUE"))

# Create the confusion matrix
cm <- confusionMatrix(predictions_factor, test_set_factor)

# Print the confusion matrix
print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    96   60
##      TRUE     33   36
##                                           
##                Accuracy : 0.5867          
##                  95% CI : (0.5193, 0.6517)
##     No Information Rate : 0.5733          
##     P-Value [Acc > NIR] : 0.369214        
##                                           
##                   Kappa : 0.1236          
##                                           
##  Mcnemar's Test P-Value : 0.007016        
##                                           
##             Sensitivity : 0.7442          
##             Specificity : 0.3750          
##          Pos Pred Value : 0.6154          
##          Neg Pred Value : 0.5217          
##              Prevalence : 0.5733          
##          Detection Rate : 0.4267          
##    Detection Prevalence : 0.6933          
##       Balanced Accuracy : 0.5596          
##                                           
##        'Positive' Class : FALSE           
## 
  1. Prediction Fictional Apartment
#b.
# Create a data frame for the fictional apartment
# Create a data frame for the fictional apartment
kalibataCity <- data.frame(
  property_type = "Entire rental unit",
  room_type = "Entire home/apt",
  accommodates = "Small",
  bedrooms = "1-2",
  beds = "1-2",
  maximum_nights = "4-7",
  bathrooms = "1",
  shared_bathroom = "No",
  Kitchen = "1",
  Wifi = "1",
  Elevator = "0",
  Dishes_and_silverware = "1",
  Washer = "0",
  Body_soap = "1",
  Microwave = "1",
  Paid_parking_off_premises = "1",
  host_response_rate = "95-99%",
  host_identity_verified = "TRUE"
)

# Make the prediction
prediction <- predict(nb_model, kalibataCity)

# Print the prediction result
print(prediction)
## [1] FALSE
## Levels: FALSE TRUE

Summary

To build a predictive model, the first step involved data preprocessing and cleaning, where we transformed certain variables into numeric variables and binned numerical variables using equal frequency. Additionally, we converted several variables into factor data types to make them suitable for input in the Naive Bayes model. We also removed some index variables, including names, as they would not be meaningful in the model. Once the data was prepared, we proceeded to the feature selection stage, where we created bar plots for all the remaining variables to evaluate their distribution. If the distribution of a variable was relatively similar, we considered it to have low predictive power and removed it from the model.

After feature selection, we partitioned our data into 60% for training and 40% for testing. The Naive Bayes model was then trained using the training data, and its performance was evaluated on the test data. The model achieved an accuracy of 0.6327, which provides a reasonable estimate of how well the model will perform on new instances. In addition to the data partitioning and model evaluation, we created a fictional apartment named “Kalibata City” to test the model’s performance in a practical scenario. This apartment had specific attributes such as property type, room type, accommodations, number of bedrooms and beds, maximum nights, bathroom availability, shared bathroom status, and various amenities. We input the details of this fictional apartment into our trained Naive Bayes model to predict whether it would be instant bookable (TRUE) or not (FALSE).

The model returned a prediction of “FALSE,” indicating that, based on the given features, this specific apartment may not qualify as an instant bookable property.

Classification Part III: Classification Tree

Classification Tree predictive model was built through the following steps:

  1. Preparing the data for Classification Tree Model
# binning rating into three
merged <- data_new %>%
  mutate(rating_bin = ntile(review_scores_rating, 3))
merged$rating_bin <- factor(merged$rating_bin, labels = c("low","medium","high"))
table(merged$rating_bin)
## 
##    low medium   high 
##    188    188    187
# remove ID, name, latitude, longitude, host_id, because index is irrelevant. Prepare other variable for the tree model input
merged <- select(merged, -c(id, name, latitude, longitude, host_id,review_scores_rating))
merged$host_acceptance_rate[merged$host_acceptance_rate == "N/A"] <- 0
merged$host_acceptance_rate <- as.numeric(gsub("%", "", merged$host_acceptance_rate))
merged$host_response_rate[merged$host_response_rate == "N/A"] <- 0
merged$host_response_rate <- as.numeric(gsub("%", "", merged$host_response_rate))

# binning property type because the it contain so many variable. It will be bin into Entire Home, Private Room and Other
merged <- merged %>%
  mutate(property_type_bin = case_when(
    property_type %in% c("Entire home", "Entire apartment", "Entire condo", "Entire serviced apartment", "Entire villa", "Entire townhouse") ~ "Entire Home",
    property_type %in% c("Private room in rental unit", "Private room in condo", "Private room in home", "Private room in serviced apartment", "Private room in villa", "Private room in townhouse") ~ "Private Room",
    TRUE ~ "Other"
  ))
merged <- select(merged, -property_type)

# remove all review scores column because it redundant with review scores rating
merged <- subset(merged, select = -c(review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_location, review_scores_value,review_scores_communication))
  1. Building the Classification Tree
# Split the data into training and testing sets
set.seed(123)
train_idx <- sample(nrow(merged), 0.6*nrow(merged))
train_data <- merged[train_idx, ]
test_data <- merged[-train_idx, ]

# Define the control parameters for tree building
ctrl <- rpart.control(minsplit = 20, xval = 10)

# Build the tree with cross-validation
tree_fit <- rpart(rating_bin ~ ., data = train_data, method = "class", control = ctrl)
printcp(tree_fit)
## 
## Classification tree:
## rpart(formula = rating_bin ~ ., data = train_data, method = "class", 
##     control = ctrl)
## 
## Variables actually used in tree construction:
##  [1] bedrooms                  host_acceptance_rate     
##  [3] host_is_superhost         Microwave                
##  [5] number_of_reviews         Paid_parking_off_premises
##  [7] price                     room_type                
##  [9] short_term_availability   years                    
## 
## Root node error: 223/337 = 0.66172
## 
## n= 337 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.295964      0   1.00000 1.10762 0.036421
## 2 0.071749      1   0.70404 0.71749 0.041108
## 3 0.017937      2   0.63229 0.68161 0.040963
## 4 0.015695      5   0.57848 0.72197 0.041120
## 5 0.013453      7   0.54709 0.72197 0.041120
## 6 0.011211     10   0.50673 0.73094 0.041139
## 7 0.010000     12   0.48430 0.73094 0.041139
# Determine the optimal CP value
optimal_cp <- tree_fit$cptable[which.min(tree_fit$cptable[,"xerror"]),"CP"]
optimal_cp
## [1] 0.01793722
# Prune the tree with the optimal CP value
pruned_tree_fit <- prune(tree_fit, cp = optimal_cp)


##C.
# Plot the pruned tree
rpart.plot(pruned_tree_fit, box.palette = "Greens")

# Predict on test data and build confusion matrix
test_pred <- predict(pruned_tree_fit, test_data, type = "class")
confusionMatrix(test_data$rating_bin, test_pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low medium high
##     low     42      2   31
##     medium  33     44    1
##     high    12      8   53
## 
## Overall Statistics
##                                           
##                Accuracy : 0.615           
##                  95% CI : (0.5482, 0.6788)
##     No Information Rate : 0.385           
##     P-Value [Acc > NIR] : 2.319e-12       
##                                           
##                   Kappa : 0.424           
##                                           
##  Mcnemar's Test P-Value : 5.656e-09       
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity              0.4828        0.8148      0.6235
## Specificity              0.7626        0.8023      0.8582
## Pos Pred Value           0.5600        0.5641      0.7260
## Neg Pred Value           0.7020        0.9324      0.7908
## Prevalence               0.3850        0.2389      0.3761
## Detection Rate           0.1858        0.1947      0.2345
## Detection Prevalence     0.3319        0.3451      0.3230
## Balanced Accuracy        0.6227        0.8086      0.7408
table(test_data$rating_bin, test_pred)
##         test_pred
##          low medium high
##   low     42      2   31
##   medium  33     44    1
##   high    12      8   53
# Create confusion matrix
conf_mat <- confusionMatrix(test_data$rating_bin, test_pred)

# Print the accuracy
conf_mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low medium high
##     low     42      2   31
##     medium  33     44    1
##     high    12      8   53
## 
## Overall Statistics
##                                           
##                Accuracy : 0.615           
##                  95% CI : (0.5482, 0.6788)
##     No Information Rate : 0.385           
##     P-Value [Acc > NIR] : 2.319e-12       
##                                           
##                   Kappa : 0.424           
##                                           
##  Mcnemar's Test P-Value : 5.656e-09       
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity              0.4828        0.8148      0.6235
## Specificity              0.7626        0.8023      0.8582
## Pos Pred Value           0.5600        0.5641      0.7260
## Neg Pred Value           0.7020        0.9324      0.7908
## Prevalence               0.3850        0.2389      0.3761
## Detection Rate           0.1858        0.1947      0.2345
## Detection Prevalence     0.3319        0.3451      0.3230
## Balanced Accuracy        0.6227        0.8086      0.7408

Summary

In developing a classification tree model to predict Airbnb listing ratings, various features were evaluated for their potential influence on the ratings. The dataset contained attributes such as host acceptance rate, host response rate, and property types, among others. These features were considered relevant since they could impact guests’ experiences and subsequently affect their ratings. To facilitate model building, cleaning and preprocessing steps were carried out, including converting percentages to numeric values, remove indexing variables and categorizing property types into broader groups.

During the exploration of different models, an interesting observation was the trade-off between the number of bins and model accuracy. It was noticed that increasing the number of bins could lead to reduced accuracy due to overfitting and data imbalance. To address this issue, the ratings were divided into three bins: low, medium, and high with equal frequency. This distribution may have impacted the model’s performance, as a slight imbalance in the data can affect the model’s ability to generalize to unseen data.

The final model was determined through a systematic process involving data splitting, tree building with cross-validation, and pruning based on the optimal CP value. The optimal CP value was found to be 0.02252252, which guided the pruning process to achieve a balance between tree complexity and classification error. The model’s performance was evaluated using a confusion matrix, and the overall accuracy was found to be 0.6106, indicating a reasonable performance for a classification problem with three categories.

Step IV: Clustering

  1. Process for Variable Selection & Model Building

First, k-means clustering is chosen as the clustering model between hierarchical clustering and k-means clustering due to computational efficiency of k-means clustering in calculating 563 observations of 41 variables.

Second, as k-means clustering is chosen, only numeric values are passed onto the model and categorical data such as name, latitude & longitude, and host_response_time, are dropped. For any values that could turn into numeric values, such as host_acceptance_rate and host_response_rate, were converted into numeric values after data manipulation.

Third, an elbow chart is created to see the general trend of total within-cluster sum of squares per the number of clusters. Because there was not a clear kink in the chart, a manual observation of data for centers for different k’s is conducted. According to the analysis, any number of clusters with k equal and above 4 does not provide discernible information for interpretation. Hence, k=3 was chosen as the number of models.

  1. Preparing data for clustering analysis
cluster <- as.data.frame(data_new)
row.names(cluster) <- cluster[,1]
cluster <- cluster[,-1]

#Select numeric variables only
num_var <- cluster %>% select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, years, host_response_rate, host_acceptance_rate)

#Change from string type to numeric type
num_var$host_response_rate <- gsub("%", "", num_var$host_response_rate)
num_var$host_acceptance_rate <- gsub("%", "", num_var$host_acceptance_rate)
num_var <- num_var %>% 
  mutate(host_response_rate = as.numeric(gsub("%", "",host_response_rate)), host_acceptance_rate = as.numeric(gsub("%", "", host_acceptance_rate)))

#Normalize the data
num_var.norm <- sapply(num_var, scale)
row.names(num_var.norm) <- row.names(num_var)
  1. Building Elbow Plot for Initial Analysis on Number of Cluster
#Create an elbow chart 
set.seed(699)
kmax <- 30
wss <- sapply(1:kmax, 
              function(k){kmeans(num_var.norm, k, nstart=50, iter.max = 30)$tot.withinss})
wss
##  [1] 10116.000  8429.449  7616.664  7076.054  6628.485  6125.637  5818.640
##  [8]  5372.863  5075.328  4832.822  4670.326  4472.784  4334.985  4206.636
## [15]  4079.128  3973.216  3835.034  3743.977  3649.898  3579.884  3472.252
## [22]  3375.344  3297.592  3244.817  3169.741  3123.603  3011.214  2981.595
## [29]  2919.982  2855.981
plot(1:kmax, wss, type = "b", pch = 20, frame = FALSE, xlab = "Number of Clusters K", ylab = "Total Within-Clusters Sum of Squares")

By looking at the elbow plot, it is safe to start iterating from k=3 to k=7, but after iterating it in separate script we found that k equals to 3 could splits the data best and provide easy to understand output. Therefore we chose k=3 for this cluster model.

  1. Building k-means clustering model
#Kmeans clustering with k=3
km3 <- kmeans(num_var.norm, 3, nstart=50)
km3$centers
##         price accommodates    bedrooms        beds  bathrooms minimum_nights
## 1 -0.05216086   -0.0579185 -0.07405650 -0.07866686 -0.1192584    -0.01767926
## 2  3.43436992    4.0524905  3.89702259  4.12784817  4.3102872    -0.01684691
## 3 -0.13807876   -0.1906681 -0.04307181 -0.04391231  0.2377260     0.13771294
##   maximum_nights number_of_reviews review_scores_rating review_scores_accuracy
## 1   -0.005842408        0.05900992            0.2703180              0.2762133
## 2    0.780449770       -0.14880293           -0.2114711             -0.1003965
## 3   -0.077305625       -0.42762224           -2.0323565             -2.0947554
##   review_scores_cleanliness review_scores_checkin review_scores_communication
## 1                 0.2411751            0.21885475                   0.2046853
## 2                -0.1389053            0.02615551                   0.2839105
## 3                -1.8210242           -1.67627386                  -1.6082843
##   review_scores_location review_scores_value       years host_response_rate
## 1              0.1691286          0.24683558  0.02136485         0.03503417
## 2              0.2449943          0.08942202  0.23236188         0.09035211
## 3             -1.3305285         -1.89995029 -0.19954732        -0.28180045
##   host_acceptance_rate
## 1           0.06116284
## 2          -0.91962408
## 3          -0.32363103
cluster3 <- km3$cluster

# Checking the cluster distance
dist(km3$centers)
##           1         2
## 2  9.186936          
## 3  5.443384 10.239104
# Binding the cluster label back to original data
num_var.norm <- cbind(num_var.norm, cluster)
cluster <- cbind(cluster, cluster3)
head(cluster)
##                                                      name  latitude longitude
## 16695                           DUPLEX LOFT 2 - SAN TELMO -34.61439 -58.37611
## 148284                Sunny Terrace Apart. in Downtown BA -34.61331 -58.38491
## 23798   STUNNING-LIGHT-SPACIOUS LOFT STYLE APT- SAN TELMO -34.61266 -58.37479
## 31514                     BEAUTY DUPLEX LOFT #4 SAN TELMO -34.61494 -58.37517
## 42450  French Classic in San Telmo: Balcony over Defensa! -34.61578 -58.37175
## 362556                   Sunny Terrace Apart in Center BA -34.61554 -58.38486
##             property_type       room_type price accommodates bedrooms beds
## 16695         Entire loft Entire home/apt 10354            4        1    1
## 148284 Entire rental unit Entire home/apt  6833            4        1    1
## 23798        Entire condo Entire home/apt 14501            3        2    3
## 31514         Entire loft Entire home/apt  9347            5        1    4
## 42450        Entire condo Entire home/apt 16152            4        2    2
## 362556 Entire rental unit Entire home/apt 10354            4        1    2
##        bathrooms shared_bathroom minimum_nights maximum_nights
## 16695          1              No              2           1125
## 148284         1              No              2           1125
## 23798          1              No              3            180
## 31514          1              No              2            365
## 42450          2              No              4           1125
## 362556         1              No              2            365
##        number_of_reviews review_scores_rating review_scores_accuracy
## 16695                 46                 4.28                   4.59
## 148284               273                 4.72                   4.68
## 23798                 58                 4.89                   4.90
## 31514                 35                 4.26                   4.44
## 42450                151                 4.81                   4.90
## 362556               145                 4.79                   4.79
##        review_scores_cleanliness review_scores_checkin
## 16695                       4.29                  4.83
## 148284                      4.59                  4.86
## 23798                       4.88                  4.91
## 31514                       4.03                  4.76
## 42450                       4.73                  4.92
## 362556                      4.76                  4.96
##        review_scores_communication review_scores_location review_scores_value
## 16695                         4.80                   4.39                4.41
## 148284                        4.83                   4.62                4.72
## 23798                         5.00                   4.88                4.86
## 31514                         4.67                   4.33                4.39
## 42450                         4.96                   4.94                4.83
## 362556                        4.89                   4.73                4.66
##        instant_bookable Kitchen Wifi Air_conditioning Elevator
## 16695              TRUE       1    1                1        0
## 148284             TRUE       1    1                0        1
## 23798             FALSE       1    1                0        0
## 31514              TRUE       1    1                1        0
## 42450             FALSE       1    1                1        1
## 362556             TRUE       1    1                0        1
##        Dishes_and_silverware Washer Body_soap Microwave
## 16695                      1      0         0         1
## 148284                     1      0         0         1
## 23798                      1      0         1         1
## 31514                      1      0         0         0
## 42450                      1      1         0         1
## 362556                     1      0         0         1
##        Paid_parking_off_premises total_amenities short_term_availability
## 16695                          1               6                       0
## 148284                         1               6                       0
## 23798                          1               6                       0
## 31514                          1               5                       0
## 42450                          1               8                       0
## 362556                         1               6                       1
##        long_term_availability    years host_id host_response_time
## 16695                       0 13.37808   64880     within an hour
## 148284                      0 12.20000  407702       within a day
## 23798                       0 12.20000  408551     within an hour
## 31514                       1 13.37808   64880     within an hour
## 42450                       0 12.77260  185437     within an hour
## 362556                      1 12.20000  407702       within a day
##        host_response_rate host_acceptance_rate host_is_superhost
## 16695                 1.0                 1.00             FALSE
## 148284                0.9                 0.83             FALSE
## 23798                 1.0                 1.00              TRUE
## 31514                 1.0                 1.00             FALSE
## 42450                 1.0                 1.00             FALSE
## 362556                0.9                 0.83             FALSE
##        host_identity_verified cluster3
## 16695                    TRUE        1
## 148284                   TRUE        1
## 23798                    TRUE        1
## 31514                    TRUE        1
## 42450                    TRUE        1
## 362556                   TRUE        1
  1. Naming the cluster

The number of reviews and review scores across the board are generally highest among three clusters, indicating the number of reviews prove the quality of listening per described.

The price, number of accommodates, bedrooms, beds, and bathrooms are highest. It indicates that the listings in Cluster2 may involve the full house designed for a group of friends or a family trip.

The price of cluster 3 is placed the lowest and the number of reviews and review scores across the board are the worst.

  1. Visualizing the Cluster: Line Plot
dev.new(width = 12, height = 50)

# Plot the data with x-axis labels
plot(c(0), xaxt = 'n', ylab = "", type = "l", xlab = "", main = "Profile Plot of Centroids",
     ylim = c(min(km3$centers), max(km3$centers)), xlim = c(0,18))

axis(1, at = c(1:18), labels = names(num_var), las = 2, cex.axis = 0.6)

lines(km3$centers[1,], lty = 1, lwd = 2, col = "red")
lines(km3$centers[2,], lty = 2, lwd = 2, col = "blue")
lines(km3$centers[3,], lty = 3, lwd = 2, col = "green")

clusters = c("Well-Reviewed & Steady", "Big Vacay", "Cheap & Shady")
text(x = rep(0.5, 2)+1.8, y = c(km3$centers[1,1]+0.5, km3$centers[2,1], km3$centers[3,1]-0.3), 
     labels = clusters)
mtext("Index", side = 1, line = 10, cex = 0.8)

Description:

The line plot above describes the cluster centroids across each variable. In alignment with the previous analysis, Cluster “Big Vacay” has a distinguishable price, number of accommodates, bedrooms, beds, and bathrooms, Cluster “Cheap & Shady” has the lowest review numbers and scores across the board, and Cluster “Well-Reviewed & Steady” averages around zero, showing its consistent performance and position.

  1. Visualizing Cluster: Scatter Plot
dev.new(width = 15, height = 50)
cluster$cluster_label <- ifelse(cluster$cluster3 == 1, clusters[1],
                                  ifelse(cluster$cluster3 == 2, clusters[2], clusters[3]))

cluster$cluster_label <- cluster$cluster_label %>% as.factor()

discretionary <- cluster %>% group_by(cluster_label) %>%
  summarize(mean_price = mean(price), 
            mean_review_scores_rating = mean(review_scores_rating))

ggplot(data = discretionary, aes(x = mean_price, y = mean_review_scores_rating, color = factor(cluster_label))) + 
  geom_point(size = 4) +
  scale_color_manual(values = c("purple", "orange", "green")) +
  theme_classic() +
  labs(x = "Average Price", y = "Average Review Scores Rating", color = "Clusters", title = "Comparison between Average Price and Average Review Scores Rating") +
  geom_text(aes(label = cluster_label),
            hjust = 0.1, vjust = 2, size = 3) +
  scale_y_continuous(limits = c(0, max(discretionary$mean_review_scores_rating) + 1))

Description:

The scatter plot above portrays the relationship between average price and average review score rating of each cluster. It is very clear that average price and average review scores ratings have neither a positive or negative relationship, as the average prices for Cheap & Shady and Well-Reviewed & Steady are very closely positioned for contrasting average review scores rating. Plus, while Big Vacay has a much higher average price point, it does not show a positive correlation to average review scores rating.

  1. Visualizing Cluster: Countplot
ggplot(cluster, aes(x = room_type, fill = cluster_label)) + geom_bar(position = "dodge") +
  labs(x = "Room Type", y = "Count", fill = "Cluster", title = "Countplot of Cluster Per Room Type") +theme(plot.title = element_text(hjust = 0.5))

Description

The count plot above illustrates the number of values in each cluster per room type. As shown, the values of Cluster “Well-Reviewed & Steady” predominantly occupy entire home/apartment type and Cluster “Big Vacay” does not exist in the room type of hotel room and shared room.

Step V: Conclusions

The data mining analysis output is a valuable asset for both property owners and prospective tenants in Monserrat. It provides both groups with data-driven insights to make informed decisions about renting and owning property.

For property owners, the data mining analysis output can help improve the service they offer by identifying the features that prospective tenants value the most. By analyzing historical data, the analysis can identify the most sought-after features in a rental property such as location, amenities, and condition. Property owners can use this information to improve their rental offerings and attract more tenants. Additionally, the analysis can provide insights into rental prices and help property owners set prices that match the market and prospective tenants’ expectations.

For prospective tenants, the data mining analysis output can help them easily choose rental properties that match their needs. The clustering model can help tenants identify properties that meet their specific requirements based on location, size, amenities, and other factors. This can save tenants time and effort by narrowing down the available options and selecting only the most suitable properties. Additionally, the analysis can help tenants negotiate better prices by providing insights into the market value of specific rental properties. Finally, by analyzing the features of rental properties, prospective tenants can predict the level of service they can expect from their landlords and make informed decisions about which properties to rent.

In conclusion, the data mining analysis output is an invaluable asset for both property owners and prospective tenants in Montserrat. By providing insights into rental prices, rental features, and service levels, the analysis can help both parties achieve their goals and make data-driven decisions. Ultimately, this can lead to a more efficient and effective rental market that benefits everyone involved.