This GitHub repository contains data mining analysis on Airbnb rental properties in Monserrat, Buenos Aires, Argentina. The analysis focuses on delivering two key outputs to enhance decision-making for property owners and prospective tenants.
1. Property Descriptive Analytics:
2. Property Predictive Analytics:
These comprehensive data mining analysis results serve as valuable assets for property owners and prospective tenants in Monserrat. By leveraging data-driven insights, users can make informed decisions and enhance their ability to choose suitable rental properties.
Explore this repository to access the analysis code, datasets, and detailed documentation. Make data-driven choices for your property investments or find the perfect Airbnb rental in Monserrat with confidence.
library(tidyverse)
library(readr)
library(naniar)
library(jsonlite)
library(tidyr)
library(tidytext)
library(wordcloud)
library(leaflet)
library(scales)
library(ggbeeswarm)
library(rpart)
library(rpart.plot)
library(corrplot)
library(caret)
library(e1071)
library(forecast)
library(FNN)
master_data <- read_csv("buenos.csv")
data <- read_csv("buenos.csv")
Please note that the explanations for what we did and why we did it are available at every steps below with the summary at the end of the steps.
data <- data %>%
filter(neighbourhood_cleansed=='Monserrat')
miss_var_summary(data)
## # A tibble: 75 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 neighbourhood_group_cleansed 902 100
## 2 bathrooms 902 100
## 3 calendar_updated 902 100
## 4 license 883 97.9
## 5 host_about 392 43.5
## 6 neighborhood_overview 375 41.6
## 7 neighbourhood 375 41.6
## 8 host_neighbourhood 356 39.5
## 9 host_location 202 22.4
## 10 review_scores_accuracy 190 21.1
## # ℹ 65 more rows
data <- data%>%
select(-neighbourhood_group_cleansed, -bathrooms, -calendar_updated, -license)
head(data)
## # A tibble: 6 × 71
## id listing_url scrape_id last_scraped source name description
## <dbl> <chr> <dbl> <date> <chr> <chr> <chr>
## 1 143663 https://www.airbnb.com… 2.02e13 2023-03-29 city … Apar… "<b>The sp…
## 2 16695 https://www.airbnb.com… 2.02e13 2023-03-29 city … DUPL… "<b>The sp…
## 3 148284 https://www.airbnb.com… 2.02e13 2023-03-29 city … Sunn… "Sunny apa…
## 4 23798 https://www.airbnb.com… 2.02e13 2023-03-29 city … STUN… <NA>
## 5 31514 https://www.airbnb.com… 2.02e13 2023-03-29 city … BEAU… "The Duple…
## 6 42450 https://www.airbnb.com… 2.02e13 2023-03-29 city … Fren… "This refi…
## # ℹ 64 more variables: neighborhood_overview <chr>, picture_url <chr>,
## # host_id <dbl>, host_url <chr>, host_name <chr>, host_since <date>,
## # host_location <chr>, host_about <chr>, host_response_time <chr>,
## # host_response_rate <chr>, host_acceptance_rate <chr>,
## # host_is_superhost <lgl>, host_thumbnail_url <chr>, host_picture_url <chr>,
## # host_neighbourhood <chr>, host_listings_count <dbl>,
## # host_total_listings_count <dbl>, host_verifications <chr>, …
listing_url, scrape_id, last_scraped, source, description, neighborhood_overview, picture_url, host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighborhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identify_verified, calendar_last_scraped, first_review, and last_review.
Combining descriptive variable and the URL columns into its own dataframe that could later be LEFT JOIN using id column.
Combining host information into its own dataframe that could later be LEFT JOIN using id column.
# Removing non-essential variable from main dataset
data <- data %>%
select(- listing_url, - scrape_id, - last_scraped, - source, - description, - neighborhood_overview, - picture_url, - host_id, - host_url, - host_name, - host_since, - host_location, - host_about, - host_response_time, - host_response_rate, - host_acceptance_rate, - host_is_superhost, - host_thumbnail_url, - host_picture_url, -host_neighbourhood, - host_listings_count, - host_total_listings_count, - host_verifications, - host_has_profile_pic, - host_identity_verified, - calendar_last_scraped, -calculated_host_listings_count, -calculated_host_listings_count_entire_homes, -calculated_host_listings_count_private_rooms, -calculated_host_listings_count_shared_rooms, -first_review, -last_review)
# Creating host related information dataset
host <- master_data %>%
filter(neighbourhood_cleansed=="Monserrat") %>%
select(id, host_id, host_url, host_name, host_since, host_location, host_about,
host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost,
host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count,
host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified,
calculated_host_listings_count, calculated_host_listings_count_entire_homes,
calculated_host_listings_count_private_rooms,
calculated_host_listings_count_shared_rooms)
# Creating Description information dataset
desc <- master_data %>%
filter(neighbourhood_cleansed=="Monserrat")%>%
select(id, description, neighborhood_overview, picture_url, listing_url)
miss_var_summary(data)
## # A tibble: 39 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 neighbourhood 375 41.6
## 2 review_scores_accuracy 190 21.1
## 3 review_scores_cleanliness 190 21.1
## 4 review_scores_checkin 190 21.1
## 5 review_scores_communication 190 21.1
## 6 review_scores_location 190 21.1
## 7 review_scores_value 190 21.1
## 8 review_scores_rating 188 20.8
## 9 reviews_per_month 188 20.8
## 10 bedrooms 88 9.76
## # ℹ 29 more rows
Based on the above information, below inferences and decisions were made
neighbourhood: this variable does not give any additional value to the overall analysis as this is a duplicate information from the “neighbourhood_cleansed”. This variable will be removed
all review scores: the missing observation related to all review scores will be removed review is something that is subjective based on the input of the user thus it would not be a wise decision to impute this as it will introduce bias to the dataset
reviews_per_month missing observation will be removed as imputing it might introduce bias to the dataset
bedrooms, beds, and bathroom_text missing observation will be removed as it is property specific and imputing the value might give a misleading the associated properly characteristics.
# Removing neighbourhood column
data <- data %>%
select (-neighbourhood)
# Removing missing observations from above-mentioned variables
data <- subset(data, complete.cases(review_scores_accuracy,
review_scores_checkin,
review_scores_cleanliness,
review_scores_communication,
review_scores_location,
review_scores_value,
review_scores_rating,
reviews_per_month,
bedrooms,
beds,
bathrooms_text))
# extract the numerical value from the bathrooms_text variable
data$bathrooms <- as.numeric(gsub("[^[:digit:]./]", "", data$bathrooms_text))
# create a new variable to indicate whether the bathroom is shared or not
data$shared_bathroom <- ifelse(grepl("shared", data$bathrooms_text, ignore.case = TRUE), "Yes", "No")
# handle cases where bathrooms_text is "shared bath" or missing
data$bathrooms[grepl("shared", data$bathrooms_text, ignore.case = TRUE) |
is.na(data$bathrooms_text)] <- NA
# handle cases where bathrooms_text is "0 bath" or "0.5 bath"
data$bathrooms[data$bathrooms == 0] <- 0.5
data$bathrooms[data$bathrooms == 0.5 & grepl("shared", data$bathrooms_text, ignore.case = TRUE)] <- NA
# handle cases where bathrooms_text is "X shared bath" or "X.X shared bath"
data$bathrooms[grepl("shared", data$bathrooms_text, ignore.case = TRUE) &
!grepl("0\\.5", data$bathrooms_text) &
!is.na(data$bathrooms_text)] <- as.numeric(gsub("[^[:digit:]./]", "", data$bathrooms_text[grepl("shared", data$bathrooms_text, ignore.case = TRUE)]))
# replace missing values with the median number of bathrooms
data$bathrooms[is.na(data$bathrooms)] <- median(data$bathrooms, na.rm = TRUE)
# split the string column into a list column
data$amenities_list <- lapply(data$amenities, jsonlite::fromJSON)
# specify the maximum length of the list
max_len <- max(lengths(data$amenities_list))
# pad shorter lists with NA values
data$amenities_list <- lapply(data$amenities_list, `length<-`, max_len)
# convert the list column to wide format
data <- unnest_wider(data, col = amenities_list, names_sep = "_")
# converting all amenities columns into categorical
for (i in 1:10) {
col_name <- paste0("amenities_list_", i)
data[[col_name]] <- as.factor(data[[col_name]])}
merged_data <- left_join(data,host, by='id')
merged_data <- left_join(merged_data, desc, by='id')
merged_data$Kitchen <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Kitchen", x)) > 0, 1, 0) })
merged_data$Wifi <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Wifi", x)) > 0, 1, 0) })
merged_data$Air_conditioning <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Air conditioning", x)) > 0, 1, 0) })
merged_data$Elevator <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Elevator", x)) > 0, 1, 0) })
merged_data$Dishes_and_silverware <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Dishes and silverware", x)) > 0, 1, 0) })
merged_data$Washer <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Washer", x)) > 0, 1, 0) })
merged_data$Body_soap <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Body soap", x)) > 0, 1, 0) })
merged_data$Microwave <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Microwave", x)) > 0, 1, 0) })
merged_data$Paid_parking_off_premises <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Paid parking off premises", x)) > 0, 1, 0) })
merged_data$TV <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("TV", x)) > 0, 1, 0) })
The availability were binned into two variables
short term: (availability 30 + availability 60 + availability 90)/3, if the value is more than mean of the short term column, then its 1 otherwise 0
long term: availability 365. if the value is more the mean of the availability 365 column, then its 1, otherwise its 0
1 means that the property has the availability for that particular short or long term, while 0 is the otherwise.
Another manipulation in this part is that another two new columns were created to aid in the analysis which are:
years: describing how many years has elapsed since the listing first listed on the AirBnB
total_amenities; describing how many variants of amenities each listing have.
merged_data <- merged_data %>%
mutate(mean_short = (availability_30+availability_60+availability_90)/3) %>%
mutate(short_term_availability = ifelse(mean_short<mean(mean_short), 0, 1)) %>%
mutate(long_term_availability = ifelse(availability_365 < mean(availability_365), 0,1)) %>%
mutate(start_date = as.Date(host_since)) %>%
mutate(end_date = as.Date("2023-05-05")) %>%
mutate(years = as.numeric(difftime(end_date, start_date)))%>%
mutate(years = years/365) %>%
mutate(total_amenities = Kitchen+Wifi+Air_conditioning+Elevator+Dishes_and_silverware+Washer
+Body_soap+Microwave+Paid_parking_off_premises)
# Remove "N/A" value in the host data
merged_data <- subset(merged_data, host_response_time != "N/A")
merged_data <- subset(merged_data, host_response_rate != "N/A")
merged_data <- subset(merged_data, host_acceptance_rate != "N/A")
# Converting host response rate and acceptance rate into numeric
merged_data$host_response_rate <- as.numeric(gsub("%", "", merged_data$host_response_rate))/100
merged_data$host_acceptance_rate <- as.numeric(gsub("%", "", merged_data$host_acceptance_rate))/100
# Preparing price data
merged_data$price <- gsub("\\$|,", "", merged_data$price)
merged_data$price <- as.numeric(merged_data$price)
# Converting room and property type data into categorical
merged_data$property_type <- as.factor(merged_data$property_type)
merged_data$room_type <- as.factor(merged_data$room_type)
The “N/A” value in the host_response_time and host_response_rate were decided to be removed due to its low proportion in the dataset. Imputing it might introduce bias.
merged_data$room_type <- as.factor(merged_data$room_type)
merged_data$instant_bookable <- as.factor(merged_data$instant_bookable)
merged_data$shared_bathroom <- as.factor(merged_data$shared_bathroom)
merged_data$host_response_time <- as.factor(merged_data$host_response_time)
merged_data$host_is_superhost <- as.factor(merged_data$host_is_superhost)
merged_data$host_identity_verified <- as.factor(merged_data$host_identity_verified)
merged_data$Kitchen <- as.factor(merged_data$Kitchen)
merged_data$Wifi <- as.factor(merged_data$Wifi)
merged_data$Air_conditioning <- as.factor(merged_data$Air_conditioning)
merged_data$Elevator <- as.factor(merged_data$Elevator)
merged_data$Dishes_and_silverware <- as.factor(merged_data$Dishes_and_silverware)
merged_data$Washer <- as.factor(merged_data$Washer)
merged_data$Body_soap <- as.factor(merged_data$Body_soap)
merged_data$Microwave <- as.factor(merged_data$Microwave)
merged_data$Paid_parking_off_premises <- as.factor(merged_data$Paid_parking_off_premises)
merged_data$short_term_availability <- as.factor(merged_data$short_term_availability)
merged_data$long_term_availability <- as.factor(merged_data$long_term_availability)
data_new <- merged_data %>%
select(id, name, latitude, longitude, property_type, room_type, price,
accommodates, bedrooms, beds,bathrooms,shared_bathroom, minimum_nights,
maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin,
review_scores_communication, review_scores_location, review_scores_value,
instant_bookable, Kitchen, Wifi, Air_conditioning, Elevator, Dishes_and_silverware,
Washer, Body_soap, Microwave, Paid_parking_off_premises,
total_amenities, short_term_availability,
long_term_availability, years, host_id, host_response_time,
host_response_rate, host_acceptance_rate, host_is_superhost, host_identity_verified,
)
write.csv(data_new, file = "data_new.csv", row.names = FALSE)
TLDR:
We removed several variables that we believed will not add much value to the analysis that we are going to focus on. We also removed observations with “N/A” or missing value because we believe that it was not possible to impute the data without introducing significant bias. We also did some “feature engineering” on several variables to simplify the modeling and analysis.
Looking at the Airbnb data for Monserrat Neighborhood, it is interesting to know what are the:
based on each property room type. Below is the summary statistics for each of the variables.
price_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_price = mean(price, na.rm = TRUE),
sd_price = sd(price, na.rm = TRUE),
median_price = median(price, na.rm = TRUE),
min_price = min(price, na.rm = TRUE),
max_price = max(price, na.rm = TRUE))
price_stats
## # A tibble: 4 × 7
## room_type observation mean_price sd_price median_price min_price max_price
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Entire home/… 439 10242. 16226. 8076 1861 262857
## 2 Hotel room 7 12340. 3660. 13017 6390 16920
## 3 Private room 103 4610. 2213. 4142 2018 16642
## 4 Shared room 14 3317. 2611. 2682. 175 11596
bathrooms_private_stats <- data_new %>%
filter(shared_bathroom=="No") %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_bathrooms = mean(bathrooms, na.rm = TRUE),
sd_bathrooms = sd(bathrooms, na.rm = TRUE),
median_bathrooms = median(bathrooms, na.rm = TRUE),
min_bathrooms = min(bathrooms, na.rm = TRUE),
max_bathrooms = max(bathrooms, na.rm = TRUE))
bathrooms_private_stats
## # A tibble: 3 × 7
## room_type observation mean_bathrooms sd_bathrooms median_bathrooms
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Entire home/apt 439 1.15 0.460 1
## 2 Hotel room 4 1 0 1
## 3 Private room 20 1.42 0.766 1
## # ℹ 2 more variables: min_bathrooms <dbl>, max_bathrooms <dbl>
bathrooms_shared_stats <- data_new %>%
filter(shared_bathroom=="Yes") %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_bathrooms = mean(bathrooms, na.rm = TRUE),
sd_bathrooms = sd(bathrooms, na.rm = TRUE),
median_bathrooms = median(bathrooms, na.rm = TRUE),
min_bathrooms = min(bathrooms, na.rm = TRUE),
max_bathrooms = max(bathrooms, na.rm = TRUE))
bathrooms_shared_stats
## # A tibble: 3 × 7
## room_type observation mean_bathrooms sd_bathrooms median_bathrooms
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Hotel room 3 2.67 1.15 2
## 2 Private room 83 2.08 2.05 1
## 3 Shared room 14 2.21 1.31 3
## # ℹ 2 more variables: min_bathrooms <dbl>, max_bathrooms <dbl>
bedrooms_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_bedrooms = mean(bedrooms, na.rm = TRUE),
sd_bedrooms = sd(bedrooms, na.rm = TRUE),
median_bedrooms = median(bedrooms, na.rm = TRUE),
min_bedrooms = min(bedrooms, na.rm = TRUE),
max_bedrooms = max(bedrooms, na.rm = TRUE))
bedrooms_stats
## # A tibble: 4 × 7
## room_type observation mean_bedrooms sd_bedrooms median_bedrooms min_bedrooms
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Entire hom… 439 1.33 0.647 1 1
## 2 Hotel room 7 1 0 1 1
## 3 Private ro… 103 2.21 2.72 1 1
## 4 Shared room 14 1 0 1 1
## # ℹ 1 more variable: max_bedrooms <dbl>
accommodates_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_accommodates = mean(accommodates, na.rm = TRUE),
sd_accommodates = sd(accommodates, na.rm = TRUE),
median_accommodates = median(accommodates, na.rm = TRUE),
min_accommodates = min(accommodates, na.rm = TRUE),
max_accommodates = max(accommodates, na.rm = TRUE))
accommodates_stats
## # A tibble: 4 × 7
## room_type observation mean_accommodates sd_accommodates median_accommodates
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Entire home… 439 3.11 1.35 3
## 2 Hotel room 7 2 0.577 2
## 3 Private room 103 2.15 2.32 2
## 4 Shared room 14 2.36 1.86 1.5
## # ℹ 2 more variables: min_accommodates <dbl>, max_accommodates <dbl>
review_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_review = mean(review_scores_rating, na.rm = TRUE),
sd_review= sd(review_scores_rating, na.rm = TRUE),
median_review = median(review_scores_rating, na.rm = TRUE),
min_review = min(review_scores_rating, na.rm = TRUE),
max_review = max(review_scores_rating, na.rm = TRUE))
review_stats
## # A tibble: 4 × 7
## room_type observation mean_review sd_review median_review min_review
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Entire home/apt 439 4.71 0.434 4.83 1
## 2 Hotel room 7 4.27 0.694 4.38 3
## 3 Private room 103 4.62 0.479 4.8 3
## 4 Shared room 14 4.74 0.422 5 4
## # ℹ 1 more variable: max_review <dbl>
amenities_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_amenities = mean(total_amenities, na.rm = TRUE),
sd_amenities= sd(total_amenities, na.rm = TRUE),
median_amenities = median(total_amenities, na.rm = TRUE),
min_amenities = min(total_amenities, na.rm = TRUE),
max_amenities = max(total_amenities, na.rm = TRUE))
amenities_stats
## # A tibble: 4 × 7
## room_type observation mean_amenities sd_amenities median_amenities
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Entire home/apt 439 5.44 1.44 5
## 2 Hotel room 7 5.86 0.378 6
## 3 Private room 103 4.37 1.28 4
## 4 Shared room 14 4.36 1.34 5
## # ℹ 2 more variables: min_amenities <dbl>, max_amenities <dbl>
Summary
Monserrat, a charming location for vacationers, offered an array of Airbnb properties for travelers. Among the options available, entire homes or apartments proved to be the most popular, far outnumbering private rooms and shared spaces. Surprisingly, hotel rooms came out as the most expensive option, while entire homes or apartments ranked a close second.
For those who value their privacy, a property that specifies a private bathroom is essential. Interestingly, all properties with private bathrooms had one bathroom per room type on average, while those with shared bathrooms had two. Private rooms were found to have the highest average number of bedrooms, with around two per room on average. On the other hand, entire homes or apartments offered the highest average number of accommodates, which was typically around three people.
When it came to amenities, hotel rooms triumphed with the highest mean of total amenities, closely followed by entire homes or apartments. Despite the differences in amenities, all room types shared a relatively similar mean review rating, indicating that the quality of the listings was consistent across the board.
With all these options to choose from, Monserrat promises an unforgettable experience for all types of travelers.
Looking at the airbnb data for Monserrat Neighborhood, it is interesting to visually see what are the:
based on each property room type. Below is the summary statistics for each of the variables.
ggplot(data_new, aes(x = room_type, y = ..count.., fill = room_type)) +
geom_bar(alpha = 0.7, width = 0.5) +
labs(x = "Room Type", y = "Count", fill = "Room Type") +
scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Number of Property Based on Room Type")
note: there are two outliers price point for the “entire home/apt” room type (262,857 USD and 216,521 USD). Those two outliers were removed to show a better visualization
# Remove two maximum values of price for entire home/apt
data_new_clean <- data_new %>%
filter(!(room_type == "Entire home/apt" & price %in% tail(sort(price), 2)))
ggplot(data_new_clean, aes(x = room_type, y = price, fill = room_type)) +
geom_boxplot(alpha = 0.7, width = 0.5) +
labs(x = "Room Type", y = "Price", fill = "Room Type") +
scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
scale_y_continuous(labels = dollar_format(prefix = "$")) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Price Distribution by Room Type")
ggplot(data_new, aes(x = room_type, y = review_scores_rating, fill = room_type)) +
geom_violin(scale = "width", alpha = 0.7) +
labs(x = "Room Type", y = "Review Scores Rating", fill = "Room Type") +
scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Review Scores Rating Distribution by Room Type")
# Calculate the sum of each amenity by room type
amenities_sum_by_roomtype <- data_new %>%
select(room_type, Kitchen, Wifi, Air_conditioning, Elevator, Dishes_and_silverware, Washer, Body_soap, Microwave) %>%
mutate(across(Kitchen: Microwave, as.numeric)) %>%
group_by(room_type) %>%
summarize_all(sum)
# Reshape data to long format for plotting
amenities_sum_by_roomtype_long <- amenities_sum_by_roomtype %>%
pivot_longer(cols = -room_type, names_to = "amenity", values_to = "count") %>%
arrange(room_type, desc(count))
# Create stacked bar plot
ggplot(amenities_sum_by_roomtype_long, aes(x = amenity, y = count, fill = room_type)) +
geom_col() +
scale_fill_manual(values = c("#F8766D", "#00BA38", "#619CFF", "#DA3B3A")) +
labs(x = "Amenities", y = "Number of Listings", fill = "Room Type") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Amenities by Room Type")
my_colors <- c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")
ggplot(data_new, aes(x = accommodates, y = price, color = room_type)) +
geom_point(alpha = 0.7, size = 3) +
scale_color_manual(values = my_colors) +
scale_y_continuous(labels = dollar_format(prefix = "$")) +
labs(x = "Accommodates", y = "Price", color = "Room Type") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Scatterplot of Price and Accommodates by Room Type")
Summary
Nestled in the stunning location of Monserrat, vacationers have an array of Airbnb properties to choose from. Dominating the market with around 400 listings, entire homes or apartments were the most popular option, followed by private rooms with around 100 listings. In contrast, the number of listings for hotel and shared rooms was relatively low.
When it comes to price, hotel rooms reign supreme as the most expensive option, followed by entire homes or apartments. Surprisingly, shared rooms were found to be the cheapest option. Entire homes or apartments boasted the broadest range of prices compared to the rest of the property room types, making them an attractive option for budget-conscious travelers.
The review ratings for all room types in Monserrat were relatively consistent, with no significant differences among them. However, entire homes or apartments had the broadest range of review ratings, spanning from 4.7 to 1. This highlights the importance of reading through reviews thoroughly before making a booking.
If amenities are essential, then entire homes or apartments would be the go-to option in Monserrat. They offer the highest number of amenities compared to the other room types. From free Wi-Fi to essential kitchen supplies, these properties cater to the needs of all types of travelers.
Interestingly, the number of accommodates does not seem to affect rental prices for all room types in Monserrat. This opens up an opportunity for larger groups to enjoy a budget-friendly stay without having to worry about spending more for the same property.
All in all, Monserrat is an excellent location for vacationers, with Airbnb properties offering something for everyone.
m <- leaflet() %>% addTiles() %>% addCircles(data = data_new, lng= ~longitude , lat= ~latitude)%>% addProviderTiles(providers$JusticeMap.income)
m
Description:
The neighborhood Monserrat is adjacent to the natural reservoir and Laguna de los Patos. Besides the nature, Monserrat has notable landmarks, such as the Casa Rosada and Plaza de Mayo, where the first is the presidential palace of Argentina and serves as the executive office of the President and the second is a historic public square that has been the site of many important political events in Argentina’s history.
# Split neighborhood_overview column into words and create a new dataframe
words <- master_data %>%
select(neighborhood_overview) %>%
unnest_tokens(word, neighborhood_overview)
# Create a custom list of stop words
custom_stopwords <- c(stop_words$word, "de", "la")
# Remove stop words and create a word frequency table
word_freq <- words %>%
anti_join(stop_words, by = "word") %>%
anti_join(data.frame(word = custom_stopwords), by = "word") %>%
count(word, sort = TRUE)
# Set the size of the graphics device
options(repr.plot.width = 8, repr.plot.height = 8)
# Generate a word cloud
wordcloud(words = word_freq$word, freq = word_freq$n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.5,
colors = brewer.pal(8, "Dark2"))
The words in the word cloud are all related to the neighborhoods and landmarks in Buenos Aires, and their prominence in the word cloud can provide insights into the most frequent and important words in the neighborhood overview column of the Buenos Aires Airbnb dataset.
“Br” is likely to stand for “Barrio” or “neighborhood” in Spanish, and its prominence in the word cloud suggests that the neighborhood overview column frequently mentions different neighborhoods in Buenos Aires. “San” is an honorific title used in place names, and its appearance in the word cloud suggests that the neighborhood overview column may include references to different streets, districts, or landmarks with this title.
“Telmo” refers to the San Telmo neighborhood in Buenos Aires, which is known for its historic architecture, tango culture, and antique markets. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of this neighborhood and its characteristics.
“Buenos” and “Aires” refer to the city of Buenos Aires, which is the capital of Argentina and one of the largest cities in South America. The appearance of these terms in the word cloud suggests that the neighborhood overview column may include descriptions of different neighborhoods and landmarks within the city.
“Mayo” refers to the Plaza de Mayo, which is a public square in the heart of Buenos Aires that is known for its historical and political significance. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of this landmark and its role in the city’s history.
“Plaza” refers to public squares and plazas, which are common features in many neighborhoods in Buenos Aires. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of different plazas and their characteristics.
The multiple regression model were constructed in the following steps.
MLR <- data_new
MLR$price <- log(MLR$price)
By using the length() and unique() functions, we were able to identify the unique values of the categorical variable in the dataset. Based on the results, we have decided to remove several variables, namely id, host_id, name, latitude, and longitude from the dataset used for multiple linear regression (MLR) because of its irrelevancy to the MLR.
Furthermore, the property_type variable is a subtype of the room_type variable, and as such, it will also be removed.
# Looking for number of unique value
length(unique(MLR$id))
## [1] 563
length(unique(MLR$host_id))
## [1] 352
length(unique(MLR$name))
## [1] 536
length(unique(MLR$latitude))
## [1] 428
length(unique(MLR$longitude))
## [1] 464
length(unique(MLR$property_type))
## [1] 22
length(unique(MLR$room_type))
## [1] 4
length(unique(MLR$host_response_time))
## [1] 4
# Removing the id, host_id, name, property_type, latitude, and longitude variable
MLR_clean <- subset(MLR, select=c(-id, -host_id, -name, -property_type, -latitude, -longitude))
According to the results below, there are some variables that have relationship value >= 0.80: review_scores_rating & review_scores_accuracy, review_scores_rating & review_scores_value, and review_scores_accuracy & review_scores_value. Therefore, the review_scores_rating and review_scores_value variable will be removed from the dataset.
library(corrplot)
# Calculating correlation between numeric variables
Corr <- cor(MLR_clean %>%
select(c(accommodates, bedrooms, beds,
minimum_nights, maximum_nights,number_of_reviews,
review_scores_rating, review_scores_accuracy,
review_scores_cleanliness,review_scores_checkin,
review_scores_communication, review_scores_location,
review_scores_value, bathrooms, host_response_rate,
host_acceptance_rate, years)))
print(Corr)
## accommodates bedrooms beds
## accommodates 1.00000000 0.39163873 0.776193337
## bedrooms 0.39163873 1.00000000 0.531683944
## beds 0.77619334 0.53168394 1.000000000
## minimum_nights -0.13467965 0.17777513 -0.027669107
## maximum_nights 0.10563102 0.04177133 0.103623324
## number_of_reviews 0.04056428 -0.06457375 0.011063160
## review_scores_rating 0.02338411 -0.03305670 -0.025240309
## review_scores_accuracy 0.04624737 -0.03039344 -0.016314203
## review_scores_cleanliness 0.03412964 -0.06843784 -0.038610731
## review_scores_checkin 0.07016071 0.03541220 0.032680840
## review_scores_communication 0.07481748 0.02091280 0.046527560
## review_scores_location 0.01134373 0.02137466 0.031129775
## review_scores_value 0.05199863 -0.06058039 0.004536392
## bathrooms 0.35909931 0.53828307 0.529161985
## host_response_rate 0.02777254 -0.02327931 -0.005885384
## host_acceptance_rate -0.05074998 -0.05396403 -0.118515085
## years 0.06097251 0.02816177 0.008185991
## minimum_nights maximum_nights number_of_reviews
## accommodates -0.134679647 0.105631016 0.04056428
## bedrooms 0.177775127 0.041771334 -0.06457375
## beds -0.027669107 0.103623324 0.01106316
## minimum_nights 1.000000000 0.103833378 -0.11052285
## maximum_nights 0.103833378 1.000000000 0.03490829
## number_of_reviews -0.110522847 0.034908288 1.00000000
## review_scores_rating -0.003509136 0.022874314 0.06583003
## review_scores_accuracy -0.001719427 0.030143104 0.10748644
## review_scores_cleanliness -0.106790913 0.010408507 0.10563770
## review_scores_checkin -0.019165431 -0.020121012 0.08515405
## review_scores_communication 0.016305678 0.071009596 0.07786747
## review_scores_location 0.055976142 0.004447184 0.07514351
## review_scores_value -0.058117135 0.061768776 0.10479946
## bathrooms 0.145787850 0.134600981 -0.04965522
## host_response_rate -0.040275899 0.043639074 0.09637063
## host_acceptance_rate -0.072882698 -0.066730768 0.11985212
## years 0.114052990 0.030606622 0.18649473
## review_scores_rating review_scores_accuracy
## accommodates 0.023384108 0.046247367
## bedrooms -0.033056704 -0.030393437
## beds -0.025240309 -0.016314203
## minimum_nights -0.003509136 -0.001719427
## maximum_nights 0.022874314 0.030143104
## number_of_reviews 0.065830033 0.107486439
## review_scores_rating 1.000000000 0.844042444
## review_scores_accuracy 0.844042444 1.000000000
## review_scores_cleanliness 0.761956529 0.752591986
## review_scores_checkin 0.626726336 0.624085514
## review_scores_communication 0.618736136 0.557996063
## review_scores_location 0.484105027 0.471564816
## review_scores_value 0.834415398 0.803564620
## bathrooms -0.080196253 -0.105078662
## host_response_rate 0.086576097 0.111045472
## host_acceptance_rate 0.077495718 0.116594755
## years 0.036039269 0.035634714
## review_scores_cleanliness review_scores_checkin
## accommodates 0.03412964 0.07016071
## bedrooms -0.06843784 0.03541220
## beds -0.03861073 0.03268084
## minimum_nights -0.10679091 -0.01916543
## maximum_nights 0.01040851 -0.02012101
## number_of_reviews 0.10563770 0.08515405
## review_scores_rating 0.76195653 0.62672634
## review_scores_accuracy 0.75259199 0.62408551
## review_scores_cleanliness 1.00000000 0.52823467
## review_scores_checkin 0.52823467 1.00000000
## review_scores_communication 0.40101096 0.65093460
## review_scores_location 0.35933422 0.45327165
## review_scores_value 0.69753638 0.58032383
## bathrooms -0.13364166 -0.08655867
## host_response_rate 0.05542183 0.09059675
## host_acceptance_rate 0.16030786 0.09636437
## years 0.04380719 0.05063207
## review_scores_communication review_scores_location
## accommodates 0.07481748 0.011343732
## bedrooms 0.02091280 0.021374658
## beds 0.04652756 0.031129775
## minimum_nights 0.01630568 0.055976142
## maximum_nights 0.07100960 0.004447184
## number_of_reviews 0.07786747 0.075143511
## review_scores_rating 0.61873614 0.484105027
## review_scores_accuracy 0.55799606 0.471564816
## review_scores_cleanliness 0.40101096 0.359334218
## review_scores_checkin 0.65093460 0.453271650
## review_scores_communication 1.00000000 0.422406115
## review_scores_location 0.42240611 1.000000000
## review_scores_value 0.52970001 0.514722162
## bathrooms -0.02987541 -0.034024904
## host_response_rate 0.04161985 0.048193586
## host_acceptance_rate -0.01617061 0.003187999
## years 0.10942055 0.052420460
## review_scores_value bathrooms host_response_rate
## accommodates 0.051998634 0.35909931 0.027772542
## bedrooms -0.060580389 0.53828307 -0.023279309
## beds 0.004536392 0.52916199 -0.005885384
## minimum_nights -0.058117135 0.14578785 -0.040275899
## maximum_nights 0.061768776 0.13460098 0.043639074
## number_of_reviews 0.104799458 -0.04965522 0.096370626
## review_scores_rating 0.834415398 -0.08019625 0.086576097
## review_scores_accuracy 0.803564620 -0.10507866 0.111045472
## review_scores_cleanliness 0.697536376 -0.13364166 0.055421835
## review_scores_checkin 0.580323826 -0.08655867 0.090596748
## review_scores_communication 0.529700012 -0.02987541 0.041619848
## review_scores_location 0.514722162 -0.03402490 0.048193586
## review_scores_value 1.000000000 -0.09202214 0.097950024
## bathrooms -0.092022136 1.00000000 -0.090084227
## host_response_rate 0.097950024 -0.09008423 1.000000000
## host_acceptance_rate 0.075671278 -0.14541633 0.439467106
## years -0.009963601 0.02339579 0.004264235
## host_acceptance_rate years
## accommodates -0.050749977 0.060972512
## bedrooms -0.053964029 0.028161773
## beds -0.118515085 0.008185991
## minimum_nights -0.072882698 0.114052990
## maximum_nights -0.066730768 0.030606622
## number_of_reviews 0.119852122 0.186494732
## review_scores_rating 0.077495718 0.036039269
## review_scores_accuracy 0.116594755 0.035634714
## review_scores_cleanliness 0.160307863 0.043807187
## review_scores_checkin 0.096364374 0.050632067
## review_scores_communication -0.016170611 0.109420548
## review_scores_location 0.003187999 0.052420460
## review_scores_value 0.075671278 -0.009963601
## bathrooms -0.145416330 0.023395789
## host_response_rate 0.439467106 0.004264235
## host_acceptance_rate 1.000000000 -0.009850326
## years -0.009850326 1.000000000
# Plotting the correlation
corrplot(Corr, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Removing the review_scores_rating and review_scores_value variable
MLR_fix <- subset(MLR_clean, select=c(-review_scores_rating, -review_scores_value))
Using the sample()
function, the MLR_fix data frame was
randomly assigned to train.df for 60% of the data, and the rest is
assigned to the valid.df.
set.seed(62)
train.index <- sample(c(1:nrow(MLR_fix)), nrow(MLR_fix)*0.6)
train.df <- MLR_fix[train.index, ]
valid.df <- MLR_fix[-train.index, ]
MLR_all <- lm(price~ ., data=train.df)
summary(MLR_all)
##
## Call:
## lm(formula = price ~ ., data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08776 -0.25897 -0.02813 0.24625 3.08743
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.540e+00 5.946e-01 14.363 < 2e-16 ***
## room_typeHotel room 2.969e-01 3.442e-01 0.862 0.38918
## room_typePrivate room -5.178e-01 1.566e-01 -3.306 0.00106 **
## room_typeShared room -1.132e+00 2.502e-01 -4.526 8.68e-06 ***
## accommodates 1.340e-01 2.550e-02 5.256 2.80e-07 ***
## bedrooms -2.155e-02 2.917e-02 -0.739 0.46059
## beds -2.800e-02 3.012e-02 -0.930 0.35322
## bathrooms 2.223e-02 4.235e-02 0.525 0.60006
## shared_bathroomYes -1.484e-01 1.566e-01 -0.948 0.34395
## minimum_nights -4.218e-03 3.925e-03 -1.075 0.28344
## maximum_nights 9.415e-05 6.035e-05 1.560 0.11978
## number_of_reviews -3.743e-04 5.712e-04 -0.655 0.51277
## review_scores_accuracy 1.052e-01 1.021e-01 1.030 0.30366
## review_scores_cleanliness -2.355e-02 8.180e-02 -0.288 0.77365
## review_scores_checkin -2.507e-02 1.207e-01 -0.208 0.83564
## review_scores_communication 2.807e-02 1.029e-01 0.273 0.78518
## review_scores_location -5.063e-02 1.062e-01 -0.477 0.63386
## instant_bookableTRUE 5.675e-02 6.121e-02 0.927 0.35455
## Kitchen1 -7.343e-02 1.563e-01 -0.470 0.63889
## Wifi1 1.685e-01 1.066e-01 1.581 0.11485
## Air_conditioning1 -5.882e-02 6.108e-02 -0.963 0.33638
## Elevator1 -1.466e-01 6.251e-02 -2.346 0.01965 *
## Dishes_and_silverware1 1.229e-01 9.498e-02 1.294 0.19671
## Washer1 1.049e-01 7.284e-02 1.440 0.15099
## Body_soap1 -3.179e-02 6.229e-02 -0.510 0.61016
## Microwave1 3.363e-02 6.928e-02 0.485 0.62771
## Paid_parking_off_premises1 4.244e-02 6.298e-02 0.674 0.50097
## total_amenities NA NA NA NA
## short_term_availability1 1.260e-01 5.772e-02 2.183 0.02983 *
## long_term_availability1 9.220e-03 5.513e-02 0.167 0.86731
## years -7.816e-03 8.547e-03 -0.914 0.36119
## host_response_timewithin a day 1.443e-01 4.051e-01 0.356 0.72198
## host_response_timewithin a few hours 5.390e-01 4.565e-01 1.181 0.23865
## host_response_timewithin an hour 5.254e-01 4.642e-01 1.132 0.25861
## host_response_rate -2.829e-01 3.978e-01 -0.711 0.47748
## host_acceptance_rate -4.690e-01 1.729e-01 -2.713 0.00705 **
## host_is_superhostTRUE 8.149e-02 6.476e-02 1.258 0.20924
## host_identity_verifiedTRUE -1.882e-02 9.441e-02 -0.199 0.84210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4635 on 300 degrees of freedom
## Multiple R-squared: 0.5071, Adjusted R-squared: 0.4479
## F-statistic: 8.573 on 36 and 300 DF, p-value: < 2.2e-16
Performing stepwise regression.
Assess the accuracy of the model against both the training set and the validation set
# Accuracy against training dataset
pred_tm <- predict(MLR.step, train.df)
accuracy(pred_tm, train.df$price)
## ME RMSE MAE MPE MAPE
## Test set 9.85429e-15 0.4452499 0.3171445 -0.2505931 3.573375
# Accuracy againts validation dataset
pred_vm <- predict(MLR.step, valid.df)
accuracy(pred_vm, valid.df$price)
## ME RMSE MAE MPE MAPE
## Test set -0.01310809 0.3997252 0.2954494 -0.3040757 3.337518
# RMSE gap between training and validation dataset
RMSE_gap <- (0.5370704-0.3925947)/0.3925947
print(RMSE_gap)
## [1] 0.3680022
# MAE gap between training and validation dataset
MAE_gap <- (0.3370304-0.3060701)/0.3060701
print(MAE_gap)
## [1] 0.1011543
The KNN predictive model was constructed using these following steps to predict certain rental properties in Monserrat will have Kitchen amenities or not.
rental <- data_new[3, ] %>%
select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights,
maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin, review_scores_communication,
review_scores_location, review_scores_value, years,
host_response_rate, host_acceptance_rate)
knn_var <- data_new %>%
select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights,
maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin, review_scores_communication,
review_scores_location, review_scores_value, years,
host_response_rate, host_acceptance_rate, Kitchen, id)
# Setting seed for reproducibility
set.seed(250)
# Random sampling the dataset index without replacement with 60% for training set
train_index_knn <- sample(c(1:nrow(knn_var)), nrow(knn_var)*0.6)
# Partition the dataset into training and validation set based on the index sampling
train_df_knn <- knn_var[train_index_knn, ]
valid_df_knn <- knn_var[-train_index_knn, ]
Normalization was done due to the different scale for each predictor variable
# Initializing normalized training, validation data, complete dataframe to originals
train_norm_df_knn <- train_df_knn
valid_norm_df_knn <- valid_df_knn
knn_var_norm<- knn_var
# Using preProcess () from the caret package to normalize predictor variables
norm_values_knn <- preProcess(train_df_knn[,1:18], method=c("center", "scale"))
train_norm_df_knn[,1:18] <- predict(norm_values_knn, train_df_knn[,1:18])
valid_norm_df_knn[,1:18] <- predict(norm_values_knn, valid_df_knn[,1:18])
knn_var_norm[,1:18] <- predict(norm_values_knn, knn_var[,1:18])
# Normalizing rental dataframe
rental_norm <- predict(norm_values_knn, rental)
# Creating knn model to predict whether rental has kitchen amenities
rental_nn <- knn(train=train_norm_df_knn[,1:18], test=rental_norm, cl=train_norm_df_knn$Kitchen, k=7)
# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance
attributes(rental_nn)
## $levels
## [1] "1"
##
## $class
## [1] "factor"
##
## $nn.index
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 174 18 113 270 104 32 4
##
## $nn.dist
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0 1.519194 1.690486 1.69414 1.707484 1.881108 1.909966
# Initialize a data frame with two columns: k, and accuracy
accuracy_df_knn <- data.frame(k=seq(1,14,1), accuracy=rep(0,14))
# Compute knn for different k on validation
for(i in 1:14){
knn.pred <- knn(train_norm_df_knn[,1:18], valid_norm_df_knn[,1:18],
cl = train_norm_df_knn$Kitchen, k=i)
accuracy_df_knn[i,2] <- confusionMatrix(knn.pred, valid_norm_df_knn$Kitchen)$overall[1] %>% round(3)
}
accuracy_df_knn
## k accuracy
## 1 1 0.951
## 2 2 0.938
## 3 3 0.960
## 4 4 0.965
## 5 5 0.951
## 6 6 0.951
## 7 7 0.947
## 8 8 0.947
## 9 9 0.947
## 10 10 0.947
## 11 11 0.947
## 12 12 0.947
## 13 13 0.947
## 14 14 0.947
Optimum k=4 were chosen based on the highest accuracy when the model was tested againts the validation set.
# Creating knn model to predict whether rental has kitchen amenities
rental_nn <- knn(train=train_norm_df_knn[,1:18], test=rental_norm, cl=train_norm_df_knn$Kitchen, k=4)
# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance
attributes(rental_nn)
## $levels
## [1] "1"
##
## $class
## [1] "factor"
##
## $nn.index
## [,1] [,2] [,3] [,4]
## [1,] 174 18 113 270
##
## $nn.dist
## [,1] [,2] [,3] [,4]
## [1,] 0 1.519194 1.690486 1.69414
data_new[3,24]
## # A tibble: 1 × 1
## Kitchen
## <fct>
## 1 1
Summary
To predict whether certain rental properties in the Monserrat neighborhood had kitchen amenities or not, a KNN predictive model was constructed through a series of steps. Firstly, the third observation of a rental property in Monserrat was selected, and its amenities information was removed to create a test observation. Next, a new numeric dataframe was built for KNN model building using a range of predictors such as price, accommodates, bedrooms, beds, bathrooms, minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, years, host_response_rate, and host_acceptance_rate. Only numeric predictors were chosen as KNN relies on a distance matrix for modeling.
Furthermore, the dataset was partitioned into training and validation sets, and normalization was done to account for the different scale of each predictor variable. Following this, a KNN predictive model was built with an arbitrary k value of 7. These steps laid the foundation for creating a model that could predict which rental properties in the Monserrat neighborhood would have kitchen amenities.
To refine the model, an optimal value for k was determined. This was done by testing the model against the validation set, and the highest accuracy was used to select the optimal value for k, which was found to be k=4. Finally, a KNN predictive model was built using k=4, which was then used to predict whether rental properties in the Monserrat neighborhood would have kitchen amenities or not. By using a range of predictors and an optimal k value, the KNN predictive model was able to provide accurate predictions on whether the third observation might have Kitchen amenities or not.
The Naive Bayes modeling was done through the following steps
# Importing data
merged2 <- data_new
# Create a vector of column names to keep
keep_vars <- c("property_type", "room_type", "accommodates", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "number_of_reviews", "instant_bookable", "bathrooms", "shared_bathroom", "Kitchen", "Wifi", "Air_conditioning", "Elevator", "Dishes_and_silverware", "Washer", "Body_soap", "Microwave", "Paid_parking_off_premises", "host_response_rate", "host_identity_verified", "short_term_availability", "long_term_availability","review_scores_rating")
# Subset the merged2 dataframe to keep only the selected columns
merged2 <- subset(merged2, select = keep_vars)
# Binning 'accommodates'
quantiles <- quantile(merged2$accommodates, probs = c(0.5))
breaks <- c(0, quantiles, Inf)
labels <- c("Small", "Large")
merged2$accommodates <- cut(merged2$accommodates, breaks = breaks, labels = labels)
table(merged2$accommodates)
##
## Small Large
## 304 259
# Binning 'bedrooms'
merged2$bedrooms <- cut(merged2$bedrooms, breaks = c(0, 1, 2, Inf), labels = c("1-2", "3-4", "5+"))
# Binning 'beds'
merged2$beds <- cut(merged2$beds, breaks = c(0, 1, 2, Inf), labels = c("1-2", "3-4", "5+"))
# Binning 'bathrooms'
merged2$bathrooms <- cut(merged2$bathrooms, breaks = c(0, 1, 2, 3, Inf), labels = c("1", "2", "3", "4+"))
# Binning 'minimum_nights'
merged2$minimum_nights <- cut(merged2$minimum_nights + runif(nrow(merged2), -0.0001, 0.0001),
breaks = quantile(merged2$minimum_nights + runif(nrow(merged2), -0.0001, 0.0001), probs = seq(0, 1, 0.25)),
labels = c("1", "2", "3", "4+"))
# Add small amount of noise to 'maximum_nights'
merged2$maximum_nights <- merged2$maximum_nights + runif(nrow(merged2), -0.0001, 0.0001)
# Binning 'maximum_nights'
merged2$maximum_nights <- cut(merged2$maximum_nights, breaks = quantile(merged2$maximum_nights, probs = seq(0, 1, 0.25)), labels = c("1-3", "4-7", "8-14", "15+"))
# Binning 'number_of_reviews'
merged2$number_of_reviews <- cut(merged2$number_of_reviews, breaks = quantile(merged2$number_of_reviews, probs = seq(0, 1, 0.25)), labels = c("1-7", "8-23", "24-56", "57+"))
# Binning 'host_response_rate'
merged2$host_response_rate <- cut(jitter(merged2$host_response_rate),
breaks = quantile(jitter(merged2$host_response_rate),
probs = seq(0, 1, 0.25),
na.rm = TRUE),
labels = c("<75%", "75-94%", "95-99%", "100%"))
# Binning 'review_scores_rating'
# Add jitter to the data
merged2$review_scores_rating<- jitter(merged2$review_scores_rating, amount = 0.001)
quantiles <- quantile(merged2$review_scores_rating, probs = seq(0, 1, 0.25), na.rm = TRUE)
if (length(unique(quantiles)) == length(quantiles)) {
# Bin the data
merged2$review_scores_rating <- cut(merged2$review_scores_rating,
breaks = quantiles,
labels = c("<80", "80-90", "90-95", "95+"),
include.lowest = TRUE)
} else {
cat("Quantiles are not unique. Please consider using different probabilities or jitter amount.")
}
# Binning 'Price'
# Calculate the quantiles for equal frequency binning
quantiles <- quantile(merged2$price, probs = seq(0, 1, length.out = 3 + 1), na.rm = TRUE, type = 5)
# Generate labels for the bins
bin_labels <- c("Low", "Medium", "High")
# Bin the data
merged2$price <- cut(merged2$price, breaks = quantiles, labels = bin_labels, include.lowest = TRUE)
# Select the categorical variables
variables <- c("property_type", "room_type", "accommodates", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "number_of_reviews", "bathrooms", "shared_bathroom", "Kitchen", "Wifi", "Air_conditioning", "Elevator", "Dishes_and_silverware", "Washer", "Body_soap", "Microwave", "Paid_parking_off_premises", "host_response_rate", "host_identity_verified", "short_term_availability", "long_term_availability", "review_scores_rating")
# Reshape the dataset
merged2_long <- merged2 %>%
select(one_of(variables), instant_bookable) %>%
gather(key = "variable", value = "value", -instant_bookable)
## Warning: attributes are not identical across measure variables; they will be
## dropped
# Create the faceted barplot
p <- ggplot(merged2_long, aes(x = value, fill = instant_bookable)) +
geom_bar(position = "dodge") +
theme_minimal() +
facet_wrap(~variable, scales = "free_x", ncol = 5) +
xlab("Value") +
ylab("Count") +
scale_fill_discrete(name = "Instant Bookable")
print(p)
based on the barplot it appears that the longterm availability, short term availability, air_conditioning,beds, number_of_reviews, price, minimum nights, review_score_rating variable may not have a strong amount of predictive power in a naive Bayes model as the distribution is relatively similar. so we gonna remove it
# List of variables to remove
variables_to_remove <- c("long_term_availability", "short_term_availability", "Air_conditioning", "number_of_reviews", "price", "minimum_nights", "review_scores_rating")
# Remove the variables
merged2 <- merged2 %>%
select(-one_of(variables_to_remove))
# Set the seed for reproducibility
set.seed(42)
# Create an 60-40 split for training and testing sets
train_index <- createDataPartition(merged2$instant_bookable, p = 0.6, list = FALSE)
train_set <- merged2[train_index, ]
test_set <- merged2[-train_index, ]
# Build the Naive Bayes model using naiveBayes() function
nb_model <- naiveBayes(instant_bookable ~ ., data = train_set)
# Summary of the model
print(nb_model)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## FALSE TRUE
## 0.5739645 0.4260355
##
## Conditional probabilities:
## property_type
## Y Entire condo Entire home Entire loft Entire rental unit
## FALSE 0.139175258 0.005154639 0.041237113 0.551546392
## TRUE 0.111111111 0.006944444 0.048611111 0.423611111
## property_type
## Y Entire serviced apartment Entire vacation home
## FALSE 0.056701031 0.020618557
## TRUE 0.083333333 0.013888889
## property_type
## Y Private room in bed and breakfast Private room in casa particular
## FALSE 0.030927835 0.010309278
## TRUE 0.000000000 0.000000000
## property_type
## Y Private room in condo Private room in guesthouse Private room in home
## FALSE 0.015463918 0.000000000 0.020618557
## TRUE 0.055555556 0.000000000 0.000000000
## property_type
## Y Private room in hostel Private room in pension
## FALSE 0.000000000 0.005154639
## TRUE 0.006944444 0.000000000
## property_type
## Y Private room in rental unit Private room in serviced apartment
## FALSE 0.077319588 0.015463918
## TRUE 0.104166667 0.006944444
## property_type
## Y Room in hostel Room in hotel Shared room in condo Shared room in hostel
## FALSE 0.000000000 0.000000000 0.000000000 0.000000000
## TRUE 0.034722222 0.055555556 0.000000000 0.013888889
## property_type
## Y Shared room in hotel Shared room in rental unit
## FALSE 0.005154639 0.005154639
## TRUE 0.000000000 0.027777778
## property_type
## Y Shared room in serviced apartment
## FALSE 0.000000000
## TRUE 0.006944444
##
## room_type
## Y Entire home/apt Hotel room Private room Shared room
## FALSE 0.81443299 0.00000000 0.17525773 0.01030928
## TRUE 0.68750000 0.03472222 0.22916667 0.04861111
##
## accommodates
## Y Small Large
## FALSE 0.5463918 0.4536082
## TRUE 0.6041667 0.3958333
##
## bedrooms
## Y 1-2 3-4 5+
## FALSE 0.76804124 0.17010309 0.06185567
## TRUE 0.77083333 0.13888889 0.09027778
##
## beds
## Y 1-2 3-4 5+
## FALSE 0.5051546 0.2371134 0.2577320
## TRUE 0.5208333 0.2152778 0.2638889
##
## maximum_nights
## Y 1-3 4-7 8-14 15+
## FALSE 0.1917098 0.2746114 0.2383420 0.2953368
## TRUE 0.3055556 0.2569444 0.2708333 0.1666667
##
## bathrooms
## Y 1 2 3 4+
## FALSE 0.78350515 0.13402062 0.02061856 0.06185567
## TRUE 0.86111111 0.09027778 0.02777778 0.02083333
##
## shared_bathroom
## Y No Yes
## FALSE 0.8402062 0.1597938
## TRUE 0.7638889 0.2361111
##
## Kitchen
## Y 0 1
## FALSE 0.005154639 0.994845361
## TRUE 0.076388889 0.923611111
##
## Wifi
## Y 0 1
## FALSE 0.09793814 0.90206186
## TRUE 0.08333333 0.91666667
##
## Elevator
## Y 0 1
## FALSE 0.3762887 0.6237113
## TRUE 0.5277778 0.4722222
##
## Dishes_and_silverware
## Y 0 1
## FALSE 0.08762887 0.91237113
## TRUE 0.25000000 0.75000000
##
## Washer
## Y 0 1
## FALSE 0.7164948 0.2835052
## TRUE 0.7222222 0.2777778
##
## Body_soap
## Y 0 1
## FALSE 0.7010309 0.2989691
## TRUE 0.7638889 0.2361111
##
## Microwave
## Y 0 1
## FALSE 0.2938144 0.7061856
## TRUE 0.4375000 0.5625000
##
## Paid_parking_off_premises
## Y 0 1
## FALSE 0.6907216 0.3092784
## TRUE 0.6875000 0.3125000
##
## host_response_rate
## Y <75% 75-94% 95-99% 100%
## FALSE 0.1907216 0.2938144 0.3350515 0.1804124
## TRUE 0.3496503 0.2377622 0.2097902 0.2027972
##
## host_identity_verified
## Y FALSE TRUE
## FALSE 0.08247423 0.91752577
## TRUE 0.06944444 0.93055556
# Generate predictions for the test set
predictions <- predict(nb_model, test_set)
# Convert predictions and test_set$instant_bookable to factors
predictions_factor <- factor(predictions, levels = c("FALSE", "TRUE"))
test_set_factor <- factor(test_set$instant_bookable, levels = c("FALSE", "TRUE"))
# Create the confusion matrix
cm <- confusionMatrix(predictions_factor, test_set_factor)
# Print the confusion matrix
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 96 60
## TRUE 33 36
##
## Accuracy : 0.5867
## 95% CI : (0.5193, 0.6517)
## No Information Rate : 0.5733
## P-Value [Acc > NIR] : 0.369214
##
## Kappa : 0.1236
##
## Mcnemar's Test P-Value : 0.007016
##
## Sensitivity : 0.7442
## Specificity : 0.3750
## Pos Pred Value : 0.6154
## Neg Pred Value : 0.5217
## Prevalence : 0.5733
## Detection Rate : 0.4267
## Detection Prevalence : 0.6933
## Balanced Accuracy : 0.5596
##
## 'Positive' Class : FALSE
##
#b.
# Create a data frame for the fictional apartment
# Create a data frame for the fictional apartment
kalibataCity <- data.frame(
property_type = "Entire rental unit",
room_type = "Entire home/apt",
accommodates = "Small",
bedrooms = "1-2",
beds = "1-2",
maximum_nights = "4-7",
bathrooms = "1",
shared_bathroom = "No",
Kitchen = "1",
Wifi = "1",
Elevator = "0",
Dishes_and_silverware = "1",
Washer = "0",
Body_soap = "1",
Microwave = "1",
Paid_parking_off_premises = "1",
host_response_rate = "95-99%",
host_identity_verified = "TRUE"
)
# Make the prediction
prediction <- predict(nb_model, kalibataCity)
# Print the prediction result
print(prediction)
## [1] FALSE
## Levels: FALSE TRUE
Summary
To build a predictive model, the first step involved data preprocessing and cleaning, where we transformed certain variables into numeric variables and binned numerical variables using equal frequency. Additionally, we converted several variables into factor data types to make them suitable for input in the Naive Bayes model. We also removed some index variables, including names, as they would not be meaningful in the model. Once the data was prepared, we proceeded to the feature selection stage, where we created bar plots for all the remaining variables to evaluate their distribution. If the distribution of a variable was relatively similar, we considered it to have low predictive power and removed it from the model.
After feature selection, we partitioned our data into 60% for training and 40% for testing. The Naive Bayes model was then trained using the training data, and its performance was evaluated on the test data. The model achieved an accuracy of 0.6327, which provides a reasonable estimate of how well the model will perform on new instances. In addition to the data partitioning and model evaluation, we created a fictional apartment named “Kalibata City” to test the model’s performance in a practical scenario. This apartment had specific attributes such as property type, room type, accommodations, number of bedrooms and beds, maximum nights, bathroom availability, shared bathroom status, and various amenities. We input the details of this fictional apartment into our trained Naive Bayes model to predict whether it would be instant bookable (TRUE) or not (FALSE).
The model returned a prediction of “FALSE,” indicating that, based on the given features, this specific apartment may not qualify as an instant bookable property.
Classification Tree predictive model was built through the following steps:
# binning rating into three
merged <- data_new %>%
mutate(rating_bin = ntile(review_scores_rating, 3))
merged$rating_bin <- factor(merged$rating_bin, labels = c("low","medium","high"))
table(merged$rating_bin)
##
## low medium high
## 188 188 187
# remove ID, name, latitude, longitude, host_id, because index is irrelevant. Prepare other variable for the tree model input
merged <- select(merged, -c(id, name, latitude, longitude, host_id,review_scores_rating))
merged$host_acceptance_rate[merged$host_acceptance_rate == "N/A"] <- 0
merged$host_acceptance_rate <- as.numeric(gsub("%", "", merged$host_acceptance_rate))
merged$host_response_rate[merged$host_response_rate == "N/A"] <- 0
merged$host_response_rate <- as.numeric(gsub("%", "", merged$host_response_rate))
# binning property type because the it contain so many variable. It will be bin into Entire Home, Private Room and Other
merged <- merged %>%
mutate(property_type_bin = case_when(
property_type %in% c("Entire home", "Entire apartment", "Entire condo", "Entire serviced apartment", "Entire villa", "Entire townhouse") ~ "Entire Home",
property_type %in% c("Private room in rental unit", "Private room in condo", "Private room in home", "Private room in serviced apartment", "Private room in villa", "Private room in townhouse") ~ "Private Room",
TRUE ~ "Other"
))
merged <- select(merged, -property_type)
# remove all review scores column because it redundant with review scores rating
merged <- subset(merged, select = -c(review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_location, review_scores_value,review_scores_communication))
# Split the data into training and testing sets
set.seed(123)
train_idx <- sample(nrow(merged), 0.6*nrow(merged))
train_data <- merged[train_idx, ]
test_data <- merged[-train_idx, ]
# Define the control parameters for tree building
ctrl <- rpart.control(minsplit = 20, xval = 10)
# Build the tree with cross-validation
tree_fit <- rpart(rating_bin ~ ., data = train_data, method = "class", control = ctrl)
printcp(tree_fit)
##
## Classification tree:
## rpart(formula = rating_bin ~ ., data = train_data, method = "class",
## control = ctrl)
##
## Variables actually used in tree construction:
## [1] bedrooms host_acceptance_rate
## [3] host_is_superhost Microwave
## [5] number_of_reviews Paid_parking_off_premises
## [7] price room_type
## [9] short_term_availability years
##
## Root node error: 223/337 = 0.66172
##
## n= 337
##
## CP nsplit rel error xerror xstd
## 1 0.295964 0 1.00000 1.10762 0.036421
## 2 0.071749 1 0.70404 0.71749 0.041108
## 3 0.017937 2 0.63229 0.68161 0.040963
## 4 0.015695 5 0.57848 0.72197 0.041120
## 5 0.013453 7 0.54709 0.72197 0.041120
## 6 0.011211 10 0.50673 0.73094 0.041139
## 7 0.010000 12 0.48430 0.73094 0.041139
# Determine the optimal CP value
optimal_cp <- tree_fit$cptable[which.min(tree_fit$cptable[,"xerror"]),"CP"]
optimal_cp
## [1] 0.01793722
# Prune the tree with the optimal CP value
pruned_tree_fit <- prune(tree_fit, cp = optimal_cp)
##C.
# Plot the pruned tree
rpart.plot(pruned_tree_fit, box.palette = "Greens")
# Predict on test data and build confusion matrix
test_pred <- predict(pruned_tree_fit, test_data, type = "class")
confusionMatrix(test_data$rating_bin, test_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium high
## low 42 2 31
## medium 33 44 1
## high 12 8 53
##
## Overall Statistics
##
## Accuracy : 0.615
## 95% CI : (0.5482, 0.6788)
## No Information Rate : 0.385
## P-Value [Acc > NIR] : 2.319e-12
##
## Kappa : 0.424
##
## Mcnemar's Test P-Value : 5.656e-09
##
## Statistics by Class:
##
## Class: low Class: medium Class: high
## Sensitivity 0.4828 0.8148 0.6235
## Specificity 0.7626 0.8023 0.8582
## Pos Pred Value 0.5600 0.5641 0.7260
## Neg Pred Value 0.7020 0.9324 0.7908
## Prevalence 0.3850 0.2389 0.3761
## Detection Rate 0.1858 0.1947 0.2345
## Detection Prevalence 0.3319 0.3451 0.3230
## Balanced Accuracy 0.6227 0.8086 0.7408
table(test_data$rating_bin, test_pred)
## test_pred
## low medium high
## low 42 2 31
## medium 33 44 1
## high 12 8 53
# Create confusion matrix
conf_mat <- confusionMatrix(test_data$rating_bin, test_pred)
# Print the accuracy
conf_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium high
## low 42 2 31
## medium 33 44 1
## high 12 8 53
##
## Overall Statistics
##
## Accuracy : 0.615
## 95% CI : (0.5482, 0.6788)
## No Information Rate : 0.385
## P-Value [Acc > NIR] : 2.319e-12
##
## Kappa : 0.424
##
## Mcnemar's Test P-Value : 5.656e-09
##
## Statistics by Class:
##
## Class: low Class: medium Class: high
## Sensitivity 0.4828 0.8148 0.6235
## Specificity 0.7626 0.8023 0.8582
## Pos Pred Value 0.5600 0.5641 0.7260
## Neg Pred Value 0.7020 0.9324 0.7908
## Prevalence 0.3850 0.2389 0.3761
## Detection Rate 0.1858 0.1947 0.2345
## Detection Prevalence 0.3319 0.3451 0.3230
## Balanced Accuracy 0.6227 0.8086 0.7408
Summary
In developing a classification tree model to predict Airbnb listing ratings, various features were evaluated for their potential influence on the ratings. The dataset contained attributes such as host acceptance rate, host response rate, and property types, among others. These features were considered relevant since they could impact guests’ experiences and subsequently affect their ratings. To facilitate model building, cleaning and preprocessing steps were carried out, including converting percentages to numeric values, remove indexing variables and categorizing property types into broader groups.
During the exploration of different models, an interesting observation was the trade-off between the number of bins and model accuracy. It was noticed that increasing the number of bins could lead to reduced accuracy due to overfitting and data imbalance. To address this issue, the ratings were divided into three bins: low, medium, and high with equal frequency. This distribution may have impacted the model’s performance, as a slight imbalance in the data can affect the model’s ability to generalize to unseen data.
The final model was determined through a systematic process involving data splitting, tree building with cross-validation, and pruning based on the optimal CP value. The optimal CP value was found to be 0.02252252, which guided the pruning process to achieve a balance between tree complexity and classification error. The model’s performance was evaluated using a confusion matrix, and the overall accuracy was found to be 0.6106, indicating a reasonable performance for a classification problem with three categories.
First, k-means clustering is chosen as the clustering model between hierarchical clustering and k-means clustering due to computational efficiency of k-means clustering in calculating 563 observations of 41 variables.
Second, as k-means clustering is chosen, only numeric values are passed onto the model and categorical data such as name, latitude & longitude, and host_response_time, are dropped. For any values that could turn into numeric values, such as host_acceptance_rate and host_response_rate, were converted into numeric values after data manipulation.
Third, an elbow chart is created to see the general trend of total within-cluster sum of squares per the number of clusters. Because there was not a clear kink in the chart, a manual observation of data for centers for different k’s is conducted. According to the analysis, any number of clusters with k equal and above 4 does not provide discernible information for interpretation. Hence, k=3 was chosen as the number of models.
cluster <- as.data.frame(data_new)
row.names(cluster) <- cluster[,1]
cluster <- cluster[,-1]
#Select numeric variables only
num_var <- cluster %>% select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, years, host_response_rate, host_acceptance_rate)
#Change from string type to numeric type
num_var$host_response_rate <- gsub("%", "", num_var$host_response_rate)
num_var$host_acceptance_rate <- gsub("%", "", num_var$host_acceptance_rate)
num_var <- num_var %>%
mutate(host_response_rate = as.numeric(gsub("%", "",host_response_rate)), host_acceptance_rate = as.numeric(gsub("%", "", host_acceptance_rate)))
#Normalize the data
num_var.norm <- sapply(num_var, scale)
row.names(num_var.norm) <- row.names(num_var)
#Create an elbow chart
set.seed(699)
kmax <- 30
wss <- sapply(1:kmax,
function(k){kmeans(num_var.norm, k, nstart=50, iter.max = 30)$tot.withinss})
wss
## [1] 10116.000 8429.449 7616.664 7076.054 6628.485 6125.637 5818.640
## [8] 5372.863 5075.328 4832.822 4670.326 4472.784 4334.985 4206.636
## [15] 4079.128 3973.216 3835.034 3743.977 3649.898 3579.884 3472.252
## [22] 3375.344 3297.592 3244.817 3169.741 3123.603 3011.214 2981.595
## [29] 2919.982 2855.981
plot(1:kmax, wss, type = "b", pch = 20, frame = FALSE, xlab = "Number of Clusters K", ylab = "Total Within-Clusters Sum of Squares")
By looking at the elbow plot, it is safe to start iterating from k=3 to k=7, but after iterating it in separate script we found that k equals to 3 could splits the data best and provide easy to understand output. Therefore we chose k=3 for this cluster model.
#Kmeans clustering with k=3
km3 <- kmeans(num_var.norm, 3, nstart=50)
km3$centers
## price accommodates bedrooms beds bathrooms minimum_nights
## 1 -0.05216086 -0.0579185 -0.07405650 -0.07866686 -0.1192584 -0.01767926
## 2 3.43436992 4.0524905 3.89702259 4.12784817 4.3102872 -0.01684691
## 3 -0.13807876 -0.1906681 -0.04307181 -0.04391231 0.2377260 0.13771294
## maximum_nights number_of_reviews review_scores_rating review_scores_accuracy
## 1 -0.005842408 0.05900992 0.2703180 0.2762133
## 2 0.780449770 -0.14880293 -0.2114711 -0.1003965
## 3 -0.077305625 -0.42762224 -2.0323565 -2.0947554
## review_scores_cleanliness review_scores_checkin review_scores_communication
## 1 0.2411751 0.21885475 0.2046853
## 2 -0.1389053 0.02615551 0.2839105
## 3 -1.8210242 -1.67627386 -1.6082843
## review_scores_location review_scores_value years host_response_rate
## 1 0.1691286 0.24683558 0.02136485 0.03503417
## 2 0.2449943 0.08942202 0.23236188 0.09035211
## 3 -1.3305285 -1.89995029 -0.19954732 -0.28180045
## host_acceptance_rate
## 1 0.06116284
## 2 -0.91962408
## 3 -0.32363103
cluster3 <- km3$cluster
# Checking the cluster distance
dist(km3$centers)
## 1 2
## 2 9.186936
## 3 5.443384 10.239104
# Binding the cluster label back to original data
num_var.norm <- cbind(num_var.norm, cluster)
cluster <- cbind(cluster, cluster3)
head(cluster)
## name latitude longitude
## 16695 DUPLEX LOFT 2 - SAN TELMO -34.61439 -58.37611
## 148284 Sunny Terrace Apart. in Downtown BA -34.61331 -58.38491
## 23798 STUNNING-LIGHT-SPACIOUS LOFT STYLE APT- SAN TELMO -34.61266 -58.37479
## 31514 BEAUTY DUPLEX LOFT #4 SAN TELMO -34.61494 -58.37517
## 42450 French Classic in San Telmo: Balcony over Defensa! -34.61578 -58.37175
## 362556 Sunny Terrace Apart in Center BA -34.61554 -58.38486
## property_type room_type price accommodates bedrooms beds
## 16695 Entire loft Entire home/apt 10354 4 1 1
## 148284 Entire rental unit Entire home/apt 6833 4 1 1
## 23798 Entire condo Entire home/apt 14501 3 2 3
## 31514 Entire loft Entire home/apt 9347 5 1 4
## 42450 Entire condo Entire home/apt 16152 4 2 2
## 362556 Entire rental unit Entire home/apt 10354 4 1 2
## bathrooms shared_bathroom minimum_nights maximum_nights
## 16695 1 No 2 1125
## 148284 1 No 2 1125
## 23798 1 No 3 180
## 31514 1 No 2 365
## 42450 2 No 4 1125
## 362556 1 No 2 365
## number_of_reviews review_scores_rating review_scores_accuracy
## 16695 46 4.28 4.59
## 148284 273 4.72 4.68
## 23798 58 4.89 4.90
## 31514 35 4.26 4.44
## 42450 151 4.81 4.90
## 362556 145 4.79 4.79
## review_scores_cleanliness review_scores_checkin
## 16695 4.29 4.83
## 148284 4.59 4.86
## 23798 4.88 4.91
## 31514 4.03 4.76
## 42450 4.73 4.92
## 362556 4.76 4.96
## review_scores_communication review_scores_location review_scores_value
## 16695 4.80 4.39 4.41
## 148284 4.83 4.62 4.72
## 23798 5.00 4.88 4.86
## 31514 4.67 4.33 4.39
## 42450 4.96 4.94 4.83
## 362556 4.89 4.73 4.66
## instant_bookable Kitchen Wifi Air_conditioning Elevator
## 16695 TRUE 1 1 1 0
## 148284 TRUE 1 1 0 1
## 23798 FALSE 1 1 0 0
## 31514 TRUE 1 1 1 0
## 42450 FALSE 1 1 1 1
## 362556 TRUE 1 1 0 1
## Dishes_and_silverware Washer Body_soap Microwave
## 16695 1 0 0 1
## 148284 1 0 0 1
## 23798 1 0 1 1
## 31514 1 0 0 0
## 42450 1 1 0 1
## 362556 1 0 0 1
## Paid_parking_off_premises total_amenities short_term_availability
## 16695 1 6 0
## 148284 1 6 0
## 23798 1 6 0
## 31514 1 5 0
## 42450 1 8 0
## 362556 1 6 1
## long_term_availability years host_id host_response_time
## 16695 0 13.37808 64880 within an hour
## 148284 0 12.20000 407702 within a day
## 23798 0 12.20000 408551 within an hour
## 31514 1 13.37808 64880 within an hour
## 42450 0 12.77260 185437 within an hour
## 362556 1 12.20000 407702 within a day
## host_response_rate host_acceptance_rate host_is_superhost
## 16695 1.0 1.00 FALSE
## 148284 0.9 0.83 FALSE
## 23798 1.0 1.00 TRUE
## 31514 1.0 1.00 FALSE
## 42450 1.0 1.00 FALSE
## 362556 0.9 0.83 FALSE
## host_identity_verified cluster3
## 16695 TRUE 1
## 148284 TRUE 1
## 23798 TRUE 1
## 31514 TRUE 1
## 42450 TRUE 1
## 362556 TRUE 1
The number of reviews and review scores across the board are generally highest among three clusters, indicating the number of reviews prove the quality of listening per described.
The price, number of accommodates, bedrooms, beds, and bathrooms are highest. It indicates that the listings in Cluster2 may involve the full house designed for a group of friends or a family trip.
The price of cluster 3 is placed the lowest and the number of reviews and review scores across the board are the worst.
dev.new(width = 12, height = 50)
# Plot the data with x-axis labels
plot(c(0), xaxt = 'n', ylab = "", type = "l", xlab = "", main = "Profile Plot of Centroids",
ylim = c(min(km3$centers), max(km3$centers)), xlim = c(0,18))
axis(1, at = c(1:18), labels = names(num_var), las = 2, cex.axis = 0.6)
lines(km3$centers[1,], lty = 1, lwd = 2, col = "red")
lines(km3$centers[2,], lty = 2, lwd = 2, col = "blue")
lines(km3$centers[3,], lty = 3, lwd = 2, col = "green")
clusters = c("Well-Reviewed & Steady", "Big Vacay", "Cheap & Shady")
text(x = rep(0.5, 2)+1.8, y = c(km3$centers[1,1]+0.5, km3$centers[2,1], km3$centers[3,1]-0.3),
labels = clusters)
mtext("Index", side = 1, line = 10, cex = 0.8)
Description:
The line plot above describes the cluster centroids across each variable. In alignment with the previous analysis, Cluster “Big Vacay” has a distinguishable price, number of accommodates, bedrooms, beds, and bathrooms, Cluster “Cheap & Shady” has the lowest review numbers and scores across the board, and Cluster “Well-Reviewed & Steady” averages around zero, showing its consistent performance and position.
dev.new(width = 15, height = 50)
cluster$cluster_label <- ifelse(cluster$cluster3 == 1, clusters[1],
ifelse(cluster$cluster3 == 2, clusters[2], clusters[3]))
cluster$cluster_label <- cluster$cluster_label %>% as.factor()
discretionary <- cluster %>% group_by(cluster_label) %>%
summarize(mean_price = mean(price),
mean_review_scores_rating = mean(review_scores_rating))
ggplot(data = discretionary, aes(x = mean_price, y = mean_review_scores_rating, color = factor(cluster_label))) +
geom_point(size = 4) +
scale_color_manual(values = c("purple", "orange", "green")) +
theme_classic() +
labs(x = "Average Price", y = "Average Review Scores Rating", color = "Clusters", title = "Comparison between Average Price and Average Review Scores Rating") +
geom_text(aes(label = cluster_label),
hjust = 0.1, vjust = 2, size = 3) +
scale_y_continuous(limits = c(0, max(discretionary$mean_review_scores_rating) + 1))
Description:
The scatter plot above portrays the relationship between average price and average review score rating of each cluster. It is very clear that average price and average review scores ratings have neither a positive or negative relationship, as the average prices for Cheap & Shady and Well-Reviewed & Steady are very closely positioned for contrasting average review scores rating. Plus, while Big Vacay has a much higher average price point, it does not show a positive correlation to average review scores rating.
ggplot(cluster, aes(x = room_type, fill = cluster_label)) + geom_bar(position = "dodge") +
labs(x = "Room Type", y = "Count", fill = "Cluster", title = "Countplot of Cluster Per Room Type") +theme(plot.title = element_text(hjust = 0.5))
Description
The count plot above illustrates the number of values in each cluster per room type. As shown, the values of Cluster “Well-Reviewed & Steady” predominantly occupy entire home/apartment type and Cluster “Big Vacay” does not exist in the room type of hotel room and shared room.
The data mining analysis output is a valuable asset for both property owners and prospective tenants in Monserrat. It provides both groups with data-driven insights to make informed decisions about renting and owning property.
For property owners, the data mining analysis output can help improve the service they offer by identifying the features that prospective tenants value the most. By analyzing historical data, the analysis can identify the most sought-after features in a rental property such as location, amenities, and condition. Property owners can use this information to improve their rental offerings and attract more tenants. Additionally, the analysis can provide insights into rental prices and help property owners set prices that match the market and prospective tenants’ expectations.
For prospective tenants, the data mining analysis output can help them easily choose rental properties that match their needs. The clustering model can help tenants identify properties that meet their specific requirements based on location, size, amenities, and other factors. This can save tenants time and effort by narrowing down the available options and selecting only the most suitable properties. Additionally, the analysis can help tenants negotiate better prices by providing insights into the market value of specific rental properties. Finally, by analyzing the features of rental properties, prospective tenants can predict the level of service they can expect from their landlords and make informed decisions about which properties to rent.
In conclusion, the data mining analysis output is an invaluable asset for both property owners and prospective tenants in Montserrat. By providing insights into rental prices, rental features, and service levels, the analysis can help both parties achieve their goals and make data-driven decisions. Ultimately, this can lead to a more efficient and effective rental market that benefits everyone involved.