Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

My data were collected from a data file called Hotel Reservations. I was surprised at how much data were collected on hotel bookings, including the number of adults, children, meal plan, cancellations, etc. This summed up to 19 separate variables based on hotel reservations total. I was curious to see if there was a difference in the number of cancellations based on how early a reservation was made.

Given your question, what is your expectation about the data?

My assumption is the greater ahead of time a person reserves a room, the greater the chance they will cancel. I am assuming this occurs because there are greater costs to canceling a reservation the closer to the date of the booking ( think cancellation fee, less time to find another hotel, etc). The earlier a reservation is made, the less chance of additional costs if the reservation is canceled. This reasoning can’t be shown in the data given, but is my best guess for why this might occur.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

reserve <-read.csv("Data/Hotel_Reservations.csv")

glimpse(reserve)
## Rows: 36,275
## Columns: 19
## $ Booking_ID                           <chr> "INN00001", "INN00002", "INN00003…
## $ no_of_adults                         <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
## $ no_of_children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_weekend_nights                 <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
## $ no_of_week_nights                    <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
## $ type_of_meal_plan                    <chr> "Meal Plan 1", "Not Selected", "M…
## $ required_car_parking_space           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ room_type_reserved                   <chr> "Room_Type 1", "Room_Type 1", "Ro…
## $ lead_time                            <int> 224, 5, 1, 211, 48, 346, 34, 83, …
## $ arrival_year                         <int> 2017, 2018, 2018, 2018, 2018, 201…
## $ arrival_month                        <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
## $ arrival_date                         <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
## $ market_segment_type                  <chr> "Offline", "Online", "Online", "O…
## $ repeated_guest                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ avg_price_per_room                   <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
## $ no_of_special_requests               <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
## $ booking_status                       <chr> "Not_Canceled", "Not_Canceled", "…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Luckily there is no missing data, so no additional adjustments needed to be made.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Based on my question of interest, I will only be focusing on the variables

lead_time, numeric: How far ahead the room was reserved in days. I will transform this into a categorical variable as well.

booking_status, categorical: whether the reservation was cancelled or not

market_segment_type, categorical: How/ Why the reservation was booked. This includes aviation, complementary, corporate, offline and online. I will use this variable further towards the end, and will only focus on the online/offline portion.

The first transformation I will make is mutating the lead_time variable into a categorical variable. The range of the values increase after lead_time is greater than 100 as the number of observations decreases significantly. It is also important to note that the way each category was separated was rather arbitrary. I also included the alphabet in front of each category name because I could not get them in the appropriate order otherwise.

reserve <- reserve %>% mutate( lead_cat = case_when( lead_time >= 0 &
                                            lead_time < 20 ~ "a[0,20)",
                                          lead_time >= 20 &
                                            lead_time < 40 ~ "b[20,40)",
                                          lead_time >= 40 &
                                            lead_time < 60 ~ "c[40,60)",
                                          lead_time >= 60 &
                                            lead_time < 80 ~ "d[60,80)",
                                          lead_time >= 80 &
                                            lead_time < 100 ~ "e[80,100)",
                                          lead_time >= 100 &
                                            lead_time < 200 ~ "f[100,200)",
                                          lead_time >= 200 &
                                            lead_time < 300 ~ "g[200,300)",
                                          lead_time >= 300 ~ "h[300, inf)"
                                
                                          )) 

reserve <- reserve %>% filter( market_segment_type %in% c("Offline", "Online")) #This will be used to see if reserving offline or online makes any difference.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

As seen below, lead_cat is a catagorical variable of lead_time, with the categories
a[0,20), b[20,40), c[40,60), d[60,80), e[80,100), f[100,200), g[200,300), h[300, inf)

glimpse(reserve)
## Rows: 33,742
## Columns: 20
## $ Booking_ID                           <chr> "INN00001", "INN00002", "INN00003…
## $ no_of_adults                         <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
## $ no_of_children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_weekend_nights                 <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
## $ no_of_week_nights                    <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
## $ type_of_meal_plan                    <chr> "Meal Plan 1", "Not Selected", "M…
## $ required_car_parking_space           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ room_type_reserved                   <chr> "Room_Type 1", "Room_Type 1", "Ro…
## $ lead_time                            <int> 224, 5, 1, 211, 48, 346, 34, 83, …
## $ arrival_year                         <int> 2017, 2018, 2018, 2018, 2018, 201…
## $ arrival_month                        <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
## $ arrival_date                         <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
## $ market_segment_type                  <chr> "Offline", "Online", "Online", "O…
## $ repeated_guest                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ avg_price_per_room                   <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
## $ no_of_special_requests               <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
## $ booking_status                       <chr> "Not_Canceled", "Not_Canceled", "…
## $ lead_cat                             <chr> "g[200,300)", "a[0,20)", "a[0,20)…

Are the values what you expected for the variables? Why or Why not?

It looks like separating lead_time into categorical variables was a success, as the numbers in lead_time fit appropriately within the catagories of lead_cat.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

Although summarize helped, I used tabyl with the help of the internet. (https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html) to help produce a table that shows both the counts and percentages of lead_cat by booking_status. You can also see that in order to summarize to work, I had to turn booking_status into a binary variable, with 1 - “Not Canceled”. However I did not need to do this using the tabyl function.

reserve <- reserve %>% mutate(booking1 = 1*(booking_status == "Not_Canceled"))

reserve %>% group_by(lead_cat) %>% summarize(mean = mean(booking1))
## # A tibble: 8 × 2
##   lead_cat     mean
##   <chr>       <dbl>
## 1 a[0,20)     0.863
## 2 b[20,40)    0.773
## 3 c[40,60)    0.766
## 4 d[60,80)    0.710
## 5 e[80,100)   0.693
## 6 f[100,200)  0.502
## 7 g[200,300)  0.278
## 8 h[300, inf) 0.185
reserve %>%  tabyl(booking_status, lead_cat)%>% 
adorn_totals("col") %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns %>% 
  gt()
booking_status a[0,20) b[20,40) c[40,60) d[60,80) e[80,100) f[100,200) g[200,300) h[300, inf) Total
Canceled 13.7% (1089) 22.7% (1058) 23.4% (859) 29.0% (914) 30.7% (734) 49.8% (3886) 72.2% (2174) 81.5% (914) 34.5% (11628)
Not_Canceled 86.3% (6852) 77.3% (3593) 76.6% (2811) 71.0% (2241) 69.3% (1656) 50.2% (3917) 27.8% (836) 18.5% (208) 65.5% (22114)

What are your findings about the summary? Are they what you expected?

Based on the percentages, It appears that the ealier the reservation, the greater the chance that the reservation will be canceled. This is what I expected overall, although it is difficult just by the summary alone how significant it is or how it looks visually.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

I am going to use multiple plots to show the data from different perspectives. The first is a bar graph that shows the difference in observations between early and late reservations. You can see the closer to the reservation time the more observations there, showing that most people reserve a room around 20 or less days before the day they reserved for. This graph does not do a great job in showing the distribution in the number that were canceled vs not canceled.

ggplot(reserve, aes( x = lead_time, fill = booking_status)) +
  geom_bar() +
  theme_minimal()

>This graph gives a better view of the change in distribution of the booking_status based on lead_cat.

ggplot(reserve, aes( x = lead_cat, fill = booking_status)) +
  geom_bar() +
  theme_minimal() 

The final graph shows the distribution looks at the distribution of booking_status by lead_time. This density plot best shows the distribution of booking_status, but doesn’t take into account the observations. For instance, there are very few observations beyond 400 days relative to the amount of observations viewed in between 0 and 100 days.

ggplot(reserve, aes(x = lead_time, fill= booking_status)) +
  geom_density(position = "fill", alpha = .75)

One other observation I wanted to make is if separating the data further into reserved online or offline had any difference in booking status. I used the following graphs to get an idea.

plot1 <- ggplot(reserve, aes( x = lead_cat, fill = booking_status)) +
  geom_bar() +
  theme_minimal() +
  
  facet_wrap(vars(market_segment_type))

plot2 <-ggplot(reserve, aes(x = lead_time, fill= booking_status)) +
  geom_density(position = "fill", alpha = .75) +
  theme_minimal()+
  facet_wrap(vars(market_segment_type))

yonline <- reserve %>% filter( market_segment_type == "Online") %>%   tabyl(booking_status, lead_cat)%>% 
adorn_totals("col") %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns %>% 
  gt()

nonline <- reserve %>% filter( market_segment_type == "Offline") %>%   tabyl(booking_status, lead_cat)%>% 
adorn_totals("col") %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns %>% 
  gt()

grid.arrange(plot1, plot2)

yonline
booking_status a[0,20) b[20,40) c[40,60) d[60,80) e[80,100) f[100,200) g[200,300) h[300, inf) Total
Canceled 15.6% (1005) 29.5% (1011) 31.5% (840) 35.1% (755) 36.5% (633) 55.6% (2788) 79.1% (1185) 93.5% (258) 36.5% (8475)
Not_Canceled 84.4% (5441) 70.5% (2417) 68.5% (1828) 64.9% (1393) 63.5% (1099) 44.4% (2229) 20.9% (314) 6.5% (18) 63.5% (14739)
nonline
booking_status a[0,20) b[20,40) c[40,60) d[60,80) e[80,100) f[100,200) g[200,300) h[300, inf) Total
Canceled 5.6% (84) 3.8% (47) 1.9% (19) 15.8% (159) 15.3% (101) 39.4% (1098) 65.5% (989) 77.5% (656) 29.9% (3153)
Not_Canceled 94.4% (1411) 96.2% (1176) 98.1% (983) 84.2% (848) 84.7% (557) 60.6% (1688) 34.5% (522) 22.5% (190) 70.1% (7375)

Although there is a slight difference between online and offline, I can’t argue a difference based on visuals alone.

Final Summary (10 points)

Summarize your research question and findings below.

Based on my original question if the change in reservation time increases the probability of cancelling, it appears that the odds of observing a cancellation increases the earlier the reservation was made. This is true for whether it was booked online or offline as well.

Are your findings what you expected? Why or Why not?

The finding are what I expected, although I did not do any analysis on the data itself besides observation. I am assuming you would do a regression analysis on probability? A risk analysis of something occurring over time?

I also could also only show that there appears to be a positive trend in cancellations over time, but can’t explain why this occurs.