Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
My research question is: How much does the addition of a bathroom cause the rent price to increase on an apartment in San Francisco? I think real estate pricing is interesting and I run an Airbnb, so I like to know what features increase the value of apartments.
I plan to examine this by plotting out # of bathrooms on the x axis and rent price on the y axis. I also ran a linear regression model with number of bathrooms as the predictor variable and rent price as the outcome variable. I created two models, one which did not control for the square footage of the apartment and one which did, in order to see how the number of baths affects rent price when comparing apartments of the same size.
Given your question, what is your expectation about the data?
I expect that more bathrooms will be associated with higher rent prices even after accounting for differences in the overall size of the apartments.
Load the data below and use
dplyr::glimpse()orskimr::skim()on the data. You should upload the data file into thedatadirectory.
# Read in with tidytuesdayR package
#install.packages("tidytuesdayR") I commented this line out after it was installed.
tuesdata <- tidytuesdayR::tt_load('2022-07-05')
## --- Compiling #TidyTuesday Information for 2022-07-05 ----
## --- There are 3 files available ---
## --- Starting Download ---
##
## Downloading file 1 of 3: `rent.csv`
## Downloading file 2 of 3: `sf_permits.csv`
## Downloading file 3 of 3: `new_construction.csv`
## --- Download complete ---
rent <- tuesdata$rent
glimpse(rent)
## Rows: 200,796
## Columns: 17
## $ post_id <chr> "pre2013_134138", "pre2013_135669", "pre2013_127127", "pre…
## $ date <dbl> 20050111, 20050126, 20041017, 20120601, 20041021, 20060411…
## $ year <dbl> 2005, 2005, 2004, 2012, 2004, 2006, 2007, 2017, 2009, 2006…
## $ nhood <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ city <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ county <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ price <dbl> 1250, 1295, 1100, 1425, 890, 825, 1500, 2925, 450, 1395, 1…
## $ beds <dbl> 2, 2, 2, 1, 1, 1, 1, 3, NA, 2, 2, 5, 4, 0, 4, 1, 3, 3, 1, …
## $ baths <dbl> 2, NA, NA, NA, NA, NA, 1, NA, 1, NA, NA, NA, 3, NA, NA, NA…
## $ sqft <dbl> NA, NA, NA, 735, NA, NA, NA, NA, NA, NA, NA, 2581, 1756, N…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 37.53494, …
## $ lon <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ title <chr> "$1250 / 2br - 2BR/2BA 1145 ALAMEDA DE LAS PULGAS", "$12…
## $ descr <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ details <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "<p class=…
colnames(rent)
## [1] "post_id" "date" "year" "nhood" "city"
## [6] "county" "price" "beds" "baths" "sqft"
## [11] "room_in_apt" "address" "lat" "lon" "title"
## [16] "descr" "details"
rent %>% tabyl(baths) #158121 NAs for baths -> need to remove these
## baths n percent valid_percent
## 1.0 17644 8.787028e-02 4.134505e-01
## 1.5 2710 1.349628e-02 6.350322e-02
## 2.0 16172 8.053945e-02 3.789572e-01
## 2.5 3309 1.647941e-02 7.753954e-02
## 3.0 2016 1.004004e-02 4.724077e-02
## 3.5 386 1.922349e-03 9.045108e-03
## 4.0 258 1.284886e-03 6.045694e-03
## 4.5 73 3.635531e-04 1.710603e-03
## 5.0 71 3.535927e-04 1.663738e-03
## 5.5 17 8.466304e-05 3.983597e-04
## 6.0 11 5.478197e-05 2.577622e-04
## 6.5 4 1.992072e-05 9.373169e-05
## 7.0 2 9.960358e-06 4.686585e-05
## 7.5 1 4.980179e-06 2.343292e-05
## 8.0 1 4.980179e-06 2.343292e-05
## NA 158121 7.874709e-01 NA
sum(is.na(rent$sqft)) #136117 NAs for sqft
## [1] 136117
sum(is.na(rent$price)) #No NAs for price
## [1] 0
If there are any quirks that you have to deal with
NAcoded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
I need to remove all the observations that have missing values for either baths or sqft.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when(), etc.
rent2 <- rent
#Remove obs with NAs for baths or sqft and save as rent2
rent2 <- rent %>% drop_na(baths)%>% drop_na(sqft)
#Baths
rent2 %>% tabyl(baths) #currently listed as a continuous variable, but should be categorical with values from 1.0 to 8.0 by 0.5 increments
## baths n percent
## 1.0 5496 3.749233e-01
## 1.5 778 5.307320e-02
## 2.0 5654 3.857016e-01
## 2.5 1398 9.536803e-02
## 3.0 914 6.235077e-02
## 3.5 219 1.493963e-02
## 4.0 108 7.367488e-03
## 4.5 36 2.455829e-03
## 5.0 39 2.660482e-03
## 5.5 7 4.775223e-04
## 6.0 4 2.728699e-04
## 6.5 3 2.046524e-04
## 7.0 2 1.364350e-04
## 8.0 1 6.821748e-05
rent2 <- rent2 %>% mutate(baths=factor(baths))
typeof(rent2$baths) #was double -> now an integer
## [1] "integer"
levels(rent2$baths) #levels are correct
## [1] "1" "1.5" "2" "2.5" "3" "3.5" "4" "4.5" "5" "5.5" "6" "6.5"
## [13] "7" "8"
#Neighborhood, city, and county should capitalized, but they're not, so I'll use the str_to_title function to capitalize the data and across() to select these 3 columns.
rent2 <- rent2 %>%
mutate(
across(.cols = c("nhood", "city", "county"),
.fns = str_to_title)
) %>%
glimpse()
## Rows: 14,659
## Columns: 17
## $ post_id <chr> "4168358289", "pre2013_59350", "pre2013_72024", "pre2013_6…
## $ date <dbl> 20131103, 20120318, 20120729, 20120402, 20161009, 20120314…
## $ year <dbl> 2013, 2012, 2012, 2012, 2016, 2012, 2012, 2013, 2016, 2014…
## $ nhood <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ city <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ county <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ price <dbl> 2595, 1375, 1950, 1640, 3100, 1000, 1555, 2150, 2100, 4200…
## $ beds <dbl> 4, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 2, 2, 2, 2, 4, 2, 2, 1…
## $ baths <fct> 3, 1, 2, 1.5, 1, 1, 2.5, 3, 1, 2.5, 2.5, 2.5, 1, 1, 1, 1, …
## $ sqft <dbl> 1756, 700, 1400, 895, 1200, 1185, 1626, 1014, 752, 2205, 1…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat <dbl> 37.53494, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lon <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -122.2…
## $ title <chr> "Nov 2 Newly remodeled 4br/3ba (2 mastersuites) in quiet …
## $ descr <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Summe…
## $ details <chr> "<p class=\"row\" data-latitude=\"37.534936\" data-longitu…
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join,inner_join, orright_joinon these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse(),skim()orhead()to illustrate your point.
glimpse(rent2) #baths is fct now in rent2 whereas it was dbl in rent
## Rows: 14,659
## Columns: 17
## $ post_id <chr> "4168358289", "pre2013_59350", "pre2013_72024", "pre2013_6…
## $ date <dbl> 20131103, 20120318, 20120729, 20120402, 20161009, 20120314…
## $ year <dbl> 2013, 2012, 2012, 2012, 2016, 2012, 2012, 2013, 2016, 2014…
## $ nhood <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ city <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ county <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ price <dbl> 2595, 1375, 1950, 1640, 3100, 1000, 1555, 2150, 2100, 4200…
## $ beds <dbl> 4, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 2, 2, 2, 2, 4, 2, 2, 1…
## $ baths <fct> 3, 1, 2, 1.5, 1, 1, 2.5, 3, 1, 2.5, 2.5, 2.5, 1, 1, 1, 1, …
## $ sqft <dbl> 1756, 700, 1400, 895, 1200, 1185, 1626, 1014, 752, 2205, 1…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat <dbl> 37.53494, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lon <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -122.2…
## $ title <chr> "Nov 2 Newly remodeled 4br/3ba (2 mastersuites) in quiet …
## $ descr <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Summe…
## $ details <chr> "<p class=\"row\" data-latitude=\"37.534936\" data-longitu…
rent2 %>% skim() #confirms that all the NAs in baths and sqft have been dropped
| Name | Piped data |
| Number of rows | 14659 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| post_id | 0 | 1.00 | 10 | 14 | 0 | 14659 | 0 |
| nhood | 0 | 1.00 | 4 | 38 | 0 | 146 | 0 |
| city | 0 | 1.00 | 5 | 14 | 0 | 88 | 0 |
| county | 103 | 0.99 | 4 | 13 | 0 | 10 | 0 |
| address | 12872 | 0.12 | 1 | 38 | 0 | 1389 | 0 |
| title | 134 | 0.99 | 6 | 298 | 0 | 14203 | 0 |
| descr | 12778 | 0.13 | 44 | 8880 | 0 | 1739 | 0 |
| details | 12064 | 0.18 | 11 | 593 | 0 | 2436 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| baths | 0 | 1 | FALSE | 14 | 2: 5654, 1: 5496, 2.5: 1398, 3: 914 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| date | 0 | 1.00 | 20136779.10 | 25138.89 | 20010625.00 | 20120216.00 | 20140824.00 | 20160126.00 | 20180717.00 | ▁▁▂▇▇ |
| year | 0 | 1.00 | 2013.62 | 2.52 | 2001.00 | 2012.00 | 2014.00 | 2016.00 | 2018.00 | ▁▁▂▇▇ |
| price | 0 | 1.00 | 2784.55 | 1583.52 | 350.00 | 1850.00 | 2450.00 | 3250.00 | 24000.00 | ▇▁▁▁▁ |
| beds | 30 | 1.00 | 2.32 | 1.03 | 0.00 | 2.00 | 2.00 | 3.00 | 8.00 | ▂▇▁▁▁ |
| sqft | 0 | 1.00 | 1272.56 | 697.82 | 110.00 | 885.50 | 1100.00 | 1500.00 | 22000.00 | ▇▁▁▁▁ |
| room_in_apt | 0 | 1.00 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| lat | 12143 | 0.17 | 37.63 | 0.34 | 33.79 | 37.36 | 37.65 | 37.79 | 39.07 | ▁▁▁▇▁ |
| lon | 12661 | 0.14 | -122.16 | 0.26 | -123.20 | -122.40 | -122.10 | -121.95 | -121.21 | ▁▅▇▇▁ |
head(rent2) #Shows that the values are what I expected based on possible values for each of the variables
## # A tibble: 6 × 17
## post_id date year nhood city county price beds baths sqft room_…¹
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 4168358289 2.01e7 2013 Alam… Alam… Alame… 2595 4 3 1756 0
## 2 pre2013_59350 2.01e7 2012 Alam… Alam… Alame… 1375 2 1 700 0
## 3 pre2013_72024 2.01e7 2012 Alam… Alam… Alame… 1950 3 2 1400 0
## 4 pre2013_64956 2.01e7 2012 Alam… Alam… Alame… 1640 2 1.5 895 0
## 5 5819631756 2.02e7 2016 Alam… Alam… Alame… 3100 2 1 1200 0
## 6 pre2013_24201 2.01e7 2012 Alam… Alam… Alame… 1000 2 1 1185 0
## # … with 6 more variables: address <chr>, lat <dbl>, lon <dbl>, title <chr>,
## # descr <chr>, details <chr>, and abbreviated variable name ¹room_in_apt
Are the values what you expected for the variables? Why or Why not?
Yes, because they’re all plausible values for the the kinds of variables I have.
Use
group_by()andsummarize()to make a summary of the data here. The summary should be relevant to your research question
#Price
#Check out how the mean price is different based on # of baths
rent2 %>% group_by(baths) %>% summarise(n=n(), mean_price=mean(price))
## # A tibble: 14 × 3
## baths n mean_price
## <fct> <int> <dbl>
## 1 1 5496 2259.
## 2 1.5 778 2480.
## 3 2 5654 2861.
## 4 2.5 1398 3168.
## 5 3 914 3764.
## 6 3.5 219 5277.
## 7 4 108 5464.
## 8 4.5 36 6828.
## 9 5 39 5338.
## 10 5.5 7 13557.
## 11 6 4 9188.
## 12 6.5 3 12500
## 13 7 2 20000
## 14 8 1 20950
#This shows that mean price goes up with number of baths
#Sq Ft
summary(rent2$sqft)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 110.0 885.5 1100.0 1272.6 1500.0 22000.0
rent2 %>% group_by(baths) %>% summarise(n=n(), mean_sqft=mean(sqft))
## # A tibble: 14 × 3
## baths n mean_sqft
## <fct> <int> <dbl>
## 1 1 5496 882.
## 2 1.5 778 1120.
## 3 2 5654 1289.
## 4 2.5 1398 1739.
## 5 3 914 2179.
## 6 3.5 219 2554.
## 7 4 108 3028.
## 8 4.5 36 3473.
## 9 5 39 3370.
## 10 5.5 7 4906.
## 11 6 4 4082.
## 12 6.5 3 5130
## 13 7 2 6600
## 14 8 1 3400
#This shows that mean sq footage goes up with number of baths for the most part, though the apt with 8 bathrooms does not follow this trend bc it's an odd outlier. I mean, who would put 8 bathrooms in a 3400 sq ft apt??
What are your findings about the summary? Are they what you expected?
Yes, they are what I expected because as number of baths goes up, price goes up.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
ggplot(rent2, aes(baths, price))+
geom_boxplot()+
labs(x="Number of Bathrooms", y="Rental Price per Month ($)", title = "Rental Price vs Number of Bathrooms in San Francisco Apartments")+ theme_bw()
ggplot(rent2, aes(sqft, price))+
geom_point()+
labs(x="Size (Sq ft)", y="Rental Price per Month ($)", title = "Rental Price vs Size of San Francisco Apartments")+ theme_dark()
library(gtsummary)
model1 <- lm(price~baths, data = rent2)
model2 <- lm(price~baths + sqft, data = rent2)
tbl_regression(model1)
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| baths | |||
| 1 | — | — | |
| 1.5 | 221 | 115, 326 | <0.001 |
| 2 | 601 | 549, 653 | <0.001 |
| 2.5 | 909 | 826, 991 | <0.001 |
| 3 | 1,504 | 1,406, 1,603 | <0.001 |
| 3.5 | 3,018 | 2,828, 3,207 | <0.001 |
| 4 | 3,205 | 2,937, 3,472 | <0.001 |
| 4.5 | 4,569 | 4,108, 5,029 | <0.001 |
| 5 | 3,079 | 2,637, 3,522 | <0.001 |
| 5.5 | 11,298 | 10,257, 12,339 | <0.001 |
| 6 | 6,928 | 5,551, 8,305 | <0.001 |
| 6.5 | 10,241 | 8,651, 11,831 | <0.001 |
| 7 | 17,741 | 15,794, 19,688 | <0.001 |
| 8 | 18,691 | 15,937, 21,444 | <0.001 |
| 1 CI = Confidence Interval | |||
tbl_regression(model2)
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| baths | |||
| 1 | — | — | |
| 1.5 | 25 | -76, 126 | 0.6 |
| 2 | 266 | 214, 319 | <0.001 |
| 2.5 | 205 | 119, 291 | <0.001 |
| 3 | 439 | 331, 548 | <0.001 |
| 3.5 | 1,645 | 1,451, 1,839 | <0.001 |
| 4 | 1,443 | 1,173, 1,713 | <0.001 |
| 4.5 | 2,442 | 1,990, 2,893 | <0.001 |
| 5 | 1,036 | 602, 1,470 | <0.001 |
| 5.5 | 7,994 | 6,988, 9,000 | <0.001 |
| 6 | 4,301 | 2,982, 5,619 | <0.001 |
| 6.5 | 6,753 | 5,228, 8,278 | <0.001 |
| 7 | 13,046 | 11,176, 14,917 | <0.001 |
| 8 | 16,623 | 13,998, 19,249 | <0.001 |
| sqft | 0.82 | 0.78, 0.86 | <0.001 |
| 1 CI = Confidence Interval | |||
Summarize your research question and findings below. Research question: How does the number of bathrooms affect rent price in San Francisco?
Findings: Rent price increases as the number of bathrooms increases. Rent price also increases as the size of the apt increases.
Based on my regression models, even when apt size is controlled for, the number of bathrooms is still significantly associated with higher rent prices. When sq footage is controlled for, the association between number of baths and rent price is slightly weaker, but number of baths has a much stronger association with rent price than does sq footage.
Are your findings what you expected? Why or Why not? Yes - bathrooms are expensive to construct and highly coveted, so it makes sense they would increase the rental value.