Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

My research question is: How much does the addition of a bathroom cause the rent price to increase on an apartment in San Francisco? I think real estate pricing is interesting and I run an Airbnb, so I like to know what features increase the value of apartments.

I plan to examine this by plotting out # of bathrooms on the x axis and rent price on the y axis. I also ran a linear regression model with number of bathrooms as the predictor variable and rent price as the outcome variable. I created two models, one which did not control for the square footage of the apartment and one which did, in order to see how the number of baths affects rent price when comparing apartments of the same size.

Given your question, what is your expectation about the data?

I expect that more bathrooms will be associated with higher rent prices even after accounting for differences in the overall size of the apartments.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Read in with tidytuesdayR package 
#install.packages("tidytuesdayR") I commented this line out after it was installed.
tuesdata <- tidytuesdayR::tt_load('2022-07-05')
## --- Compiling #TidyTuesday Information for 2022-07-05 ----
## --- There are 3 files available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 3: `rent.csv`
##  Downloading file 2 of 3: `sf_permits.csv`
##  Downloading file 3 of 3: `new_construction.csv`
## --- Download complete ---
rent <- tuesdata$rent

glimpse(rent)
## Rows: 200,796
## Columns: 17
## $ post_id     <chr> "pre2013_134138", "pre2013_135669", "pre2013_127127", "pre…
## $ date        <dbl> 20050111, 20050126, 20041017, 20120601, 20041021, 20060411…
## $ year        <dbl> 2005, 2005, 2004, 2012, 2004, 2006, 2007, 2017, 2009, 2006…
## $ nhood       <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ city        <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ county      <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ price       <dbl> 1250, 1295, 1100, 1425, 890, 825, 1500, 2925, 450, 1395, 1…
## $ beds        <dbl> 2, 2, 2, 1, 1, 1, 1, 3, NA, 2, 2, 5, 4, 0, 4, 1, 3, 3, 1, …
## $ baths       <dbl> 2, NA, NA, NA, NA, NA, 1, NA, 1, NA, NA, NA, 3, NA, NA, NA…
## $ sqft        <dbl> NA, NA, NA, 735, NA, NA, NA, NA, NA, NA, NA, 2581, 1756, N…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 37.53494, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ title       <chr> "$1250 / 2br - 2BR/2BA   1145 ALAMEDA DE LAS PULGAS", "$12…
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ details     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "<p class=…
colnames(rent)
##  [1] "post_id"     "date"        "year"        "nhood"       "city"       
##  [6] "county"      "price"       "beds"        "baths"       "sqft"       
## [11] "room_in_apt" "address"     "lat"         "lon"         "title"      
## [16] "descr"       "details"
rent %>% tabyl(baths) #158121 NAs for baths -> need to remove these
##  baths      n      percent valid_percent
##    1.0  17644 8.787028e-02  4.134505e-01
##    1.5   2710 1.349628e-02  6.350322e-02
##    2.0  16172 8.053945e-02  3.789572e-01
##    2.5   3309 1.647941e-02  7.753954e-02
##    3.0   2016 1.004004e-02  4.724077e-02
##    3.5    386 1.922349e-03  9.045108e-03
##    4.0    258 1.284886e-03  6.045694e-03
##    4.5     73 3.635531e-04  1.710603e-03
##    5.0     71 3.535927e-04  1.663738e-03
##    5.5     17 8.466304e-05  3.983597e-04
##    6.0     11 5.478197e-05  2.577622e-04
##    6.5      4 1.992072e-05  9.373169e-05
##    7.0      2 9.960358e-06  4.686585e-05
##    7.5      1 4.980179e-06  2.343292e-05
##    8.0      1 4.980179e-06  2.343292e-05
##     NA 158121 7.874709e-01            NA
sum(is.na(rent$sqft)) #136117 NAs for sqft
## [1] 136117
sum(is.na(rent$price)) #No NAs for price
## [1] 0

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

I need to remove all the observations that have missing values for either baths or sqft.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

rent2 <- rent
#Remove obs with NAs for baths or sqft and save as rent2
rent2 <- rent %>% drop_na(baths)%>% drop_na(sqft)

#Baths
rent2 %>% tabyl(baths) #currently listed as a continuous variable, but should be categorical with values from 1.0 to 8.0 by 0.5 increments
##  baths    n      percent
##    1.0 5496 3.749233e-01
##    1.5  778 5.307320e-02
##    2.0 5654 3.857016e-01
##    2.5 1398 9.536803e-02
##    3.0  914 6.235077e-02
##    3.5  219 1.493963e-02
##    4.0  108 7.367488e-03
##    4.5   36 2.455829e-03
##    5.0   39 2.660482e-03
##    5.5    7 4.775223e-04
##    6.0    4 2.728699e-04
##    6.5    3 2.046524e-04
##    7.0    2 1.364350e-04
##    8.0    1 6.821748e-05
rent2 <- rent2 %>% mutate(baths=factor(baths))

typeof(rent2$baths) #was double -> now an integer
## [1] "integer"
levels(rent2$baths) #levels are correct
##  [1] "1"   "1.5" "2"   "2.5" "3"   "3.5" "4"   "4.5" "5"   "5.5" "6"   "6.5"
## [13] "7"   "8"
#Neighborhood, city, and county should capitalized, but they're not, so I'll use the str_to_title function to capitalize the data and across() to select these 3 columns.
rent2 <- rent2 %>%
    mutate(
      across(.cols = c("nhood", "city", "county"),
             .fns = str_to_title)
      ) %>%
  glimpse()
## Rows: 14,659
## Columns: 17
## $ post_id     <chr> "4168358289", "pre2013_59350", "pre2013_72024", "pre2013_6…
## $ date        <dbl> 20131103, 20120318, 20120729, 20120402, 20161009, 20120314…
## $ year        <dbl> 2013, 2012, 2012, 2012, 2016, 2012, 2012, 2013, 2016, 2014…
## $ nhood       <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ city        <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ county      <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ price       <dbl> 2595, 1375, 1950, 1640, 3100, 1000, 1555, 2150, 2100, 4200…
## $ beds        <dbl> 4, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 2, 2, 2, 2, 4, 2, 2, 1…
## $ baths       <fct> 3, 1, 2, 1.5, 1, 1, 2.5, 3, 1, 2.5, 2.5, 2.5, 1, 1, 1, 1, …
## $ sqft        <dbl> 1756, 700, 1400, 895, 1200, 1185, 1626, 1014, 752, 2205, 1…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> 37.53494, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -122.2…
## $ title       <chr> "Nov  2 Newly remodeled 4br/3ba (2 mastersuites) in quiet …
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Summe…
## $ details     <chr> "<p class=\"row\" data-latitude=\"37.534936\" data-longitu…

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(rent2) #baths is fct now in rent2 whereas it was dbl in rent
## Rows: 14,659
## Columns: 17
## $ post_id     <chr> "4168358289", "pre2013_59350", "pre2013_72024", "pre2013_6…
## $ date        <dbl> 20131103, 20120318, 20120729, 20120402, 20161009, 20120314…
## $ year        <dbl> 2013, 2012, 2012, 2012, 2016, 2012, 2012, 2013, 2016, 2014…
## $ nhood       <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ city        <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ county      <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ price       <dbl> 2595, 1375, 1950, 1640, 3100, 1000, 1555, 2150, 2100, 4200…
## $ beds        <dbl> 4, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 2, 2, 2, 2, 4, 2, 2, 1…
## $ baths       <fct> 3, 1, 2, 1.5, 1, 1, 2.5, 3, 1, 2.5, 2.5, 2.5, 1, 1, 1, 1, …
## $ sqft        <dbl> 1756, 700, 1400, 895, 1200, 1185, 1626, 1014, 752, 2205, 1…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> 37.53494, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -122.2…
## $ title       <chr> "Nov  2 Newly remodeled 4br/3ba (2 mastersuites) in quiet …
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Summe…
## $ details     <chr> "<p class=\"row\" data-latitude=\"37.534936\" data-longitu…
rent2 %>% skim() #confirms that all the NAs in baths and sqft have been dropped
Data summary
Name Piped data
Number of rows 14659
Number of columns 17
_______________________
Column type frequency:
character 8
factor 1
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
post_id 0 1.00 10 14 0 14659 0
nhood 0 1.00 4 38 0 146 0
city 0 1.00 5 14 0 88 0
county 103 0.99 4 13 0 10 0
address 12872 0.12 1 38 0 1389 0
title 134 0.99 6 298 0 14203 0
descr 12778 0.13 44 8880 0 1739 0
details 12064 0.18 11 593 0 2436 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
baths 0 1 FALSE 14 2: 5654, 1: 5496, 2.5: 1398, 3: 914

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
date 0 1.00 20136779.10 25138.89 20010625.00 20120216.00 20140824.00 20160126.00 20180717.00 ▁▁▂▇▇
year 0 1.00 2013.62 2.52 2001.00 2012.00 2014.00 2016.00 2018.00 ▁▁▂▇▇
price 0 1.00 2784.55 1583.52 350.00 1850.00 2450.00 3250.00 24000.00 ▇▁▁▁▁
beds 30 1.00 2.32 1.03 0.00 2.00 2.00 3.00 8.00 ▂▇▁▁▁
sqft 0 1.00 1272.56 697.82 110.00 885.50 1100.00 1500.00 22000.00 ▇▁▁▁▁
room_in_apt 0 1.00 0.00 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
lat 12143 0.17 37.63 0.34 33.79 37.36 37.65 37.79 39.07 ▁▁▁▇▁
lon 12661 0.14 -122.16 0.26 -123.20 -122.40 -122.10 -121.95 -121.21 ▁▅▇▇▁
head(rent2) #Shows that the values are what I expected based on possible values for each of the variables
## # A tibble: 6 × 17
##   post_id          date  year nhood city  county price  beds baths  sqft room_…¹
##   <chr>           <dbl> <dbl> <chr> <chr> <chr>  <dbl> <dbl> <fct> <dbl>   <dbl>
## 1 4168358289     2.01e7  2013 Alam… Alam… Alame…  2595     4 3      1756       0
## 2 pre2013_59350  2.01e7  2012 Alam… Alam… Alame…  1375     2 1       700       0
## 3 pre2013_72024  2.01e7  2012 Alam… Alam… Alame…  1950     3 2      1400       0
## 4 pre2013_64956  2.01e7  2012 Alam… Alam… Alame…  1640     2 1.5     895       0
## 5 5819631756     2.02e7  2016 Alam… Alam… Alame…  3100     2 1      1200       0
## 6 pre2013_24201  2.01e7  2012 Alam… Alam… Alame…  1000     2 1      1185       0
## # … with 6 more variables: address <chr>, lat <dbl>, lon <dbl>, title <chr>,
## #   descr <chr>, details <chr>, and abbreviated variable name ¹​room_in_apt

Are the values what you expected for the variables? Why or Why not?

Yes, because they’re all plausible values for the the kinds of variables I have.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#Price
#Check out how the mean price is different based on # of baths
rent2 %>% group_by(baths) %>% summarise(n=n(), mean_price=mean(price))
## # A tibble: 14 × 3
##    baths     n mean_price
##    <fct> <int>      <dbl>
##  1 1      5496      2259.
##  2 1.5     778      2480.
##  3 2      5654      2861.
##  4 2.5    1398      3168.
##  5 3       914      3764.
##  6 3.5     219      5277.
##  7 4       108      5464.
##  8 4.5      36      6828.
##  9 5        39      5338.
## 10 5.5       7     13557.
## 11 6         4      9188.
## 12 6.5       3     12500 
## 13 7         2     20000 
## 14 8         1     20950
#This shows that mean price goes up with number of baths

#Sq Ft
summary(rent2$sqft)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   110.0   885.5  1100.0  1272.6  1500.0 22000.0
rent2 %>% group_by(baths) %>% summarise(n=n(), mean_sqft=mean(sqft))
## # A tibble: 14 × 3
##    baths     n mean_sqft
##    <fct> <int>     <dbl>
##  1 1      5496      882.
##  2 1.5     778     1120.
##  3 2      5654     1289.
##  4 2.5    1398     1739.
##  5 3       914     2179.
##  6 3.5     219     2554.
##  7 4       108     3028.
##  8 4.5      36     3473.
##  9 5        39     3370.
## 10 5.5       7     4906.
## 11 6         4     4082.
## 12 6.5       3     5130 
## 13 7         2     6600 
## 14 8         1     3400
#This shows that mean sq footage goes up with number of baths for the most part, though the apt with 8 bathrooms does not follow this trend bc it's an odd outlier. I mean, who would put 8 bathrooms in a 3400 sq ft apt??

What are your findings about the summary? Are they what you expected?

Yes, they are what I expected because as number of baths goes up, price goes up.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(rent2, aes(baths, price))+
  geom_boxplot()+
  labs(x="Number of Bathrooms", y="Rental Price per Month ($)", title = "Rental Price vs Number of Bathrooms in San Francisco Apartments")+ theme_bw()

ggplot(rent2, aes(sqft, price))+
  geom_point()+
  labs(x="Size (Sq ft)", y="Rental Price per Month ($)", title = "Rental Price vs Size of San Francisco Apartments")+ theme_dark()

library(gtsummary)
model1 <- lm(price~baths, data = rent2)
model2 <- lm(price~baths + sqft, data = rent2)
tbl_regression(model1)
Characteristic Beta 95% CI1 p-value
baths
    1
    1.5 221 115, 326 <0.001
    2 601 549, 653 <0.001
    2.5 909 826, 991 <0.001
    3 1,504 1,406, 1,603 <0.001
    3.5 3,018 2,828, 3,207 <0.001
    4 3,205 2,937, 3,472 <0.001
    4.5 4,569 4,108, 5,029 <0.001
    5 3,079 2,637, 3,522 <0.001
    5.5 11,298 10,257, 12,339 <0.001
    6 6,928 5,551, 8,305 <0.001
    6.5 10,241 8,651, 11,831 <0.001
    7 17,741 15,794, 19,688 <0.001
    8 18,691 15,937, 21,444 <0.001
1 CI = Confidence Interval
tbl_regression(model2)
Characteristic Beta 95% CI1 p-value
baths
    1
    1.5 25 -76, 126 0.6
    2 266 214, 319 <0.001
    2.5 205 119, 291 <0.001
    3 439 331, 548 <0.001
    3.5 1,645 1,451, 1,839 <0.001
    4 1,443 1,173, 1,713 <0.001
    4.5 2,442 1,990, 2,893 <0.001
    5 1,036 602, 1,470 <0.001
    5.5 7,994 6,988, 9,000 <0.001
    6 4,301 2,982, 5,619 <0.001
    6.5 6,753 5,228, 8,278 <0.001
    7 13,046 11,176, 14,917 <0.001
    8 16,623 13,998, 19,249 <0.001
sqft 0.82 0.78, 0.86 <0.001
1 CI = Confidence Interval

Final Summary (10 points)

Summarize your research question and findings below. Research question: How does the number of bathrooms affect rent price in San Francisco?

Findings: Rent price increases as the number of bathrooms increases. Rent price also increases as the size of the apt increases.

Based on my regression models, even when apt size is controlled for, the number of bathrooms is still significantly associated with higher rent prices. When sq footage is controlled for, the association between number of baths and rent price is slightly weaker, but number of baths has a much stronger association with rent price than does sq footage.

Are your findings what you expected? Why or Why not? Yes - bathrooms are expensive to construct and highly coveted, so it makes sense they would increase the rental value.