Midterm

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

My research question is: How much does the addition of a bathroom cause the rent price to increase on an apartment in San Francisco? I think real estate pricing is interesting and I run an Airbnb, so I like to know what features increase the value of apartments.

I plan to examine this by plotting out # of bathrooms on the x axis and rent price on the y axis. I also ran a linear regression model with number of bathrooms as the predictor variable and rent price as the outcome variable. I created two models, one which did not control for the square footage of the apartment and one which did, in order to see how the number of baths affects rent price when comparing apartments of the same size.

Given your question, what is your expectation about the data?

I expect that more bathrooms will be associated with higher rent prices even after accounting for differences in the overall size of the apartments.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Read in with tidytuesdayR package 
#install.packages("tidytuesdayR") I commented this line out after it was installed.
tuesdata <- tidytuesdayR::tt_load('2022-07-05')

## --- Compiling #TidyTuesday Information for 2022-07-05 ----

## --- There are 3 files available ---

## --- Starting Download ---

## 
##  Downloading file 1 of 3: `rent.csv`
##  Downloading file 2 of 3: `sf_permits.csv`
##  Downloading file 3 of 3: `new_construction.csv`

## --- Download complete ---

rent <- tuesdata$rent

glimpse(rent)

## Rows: 200,796
## Columns: 17
## $ post_id     <chr> "pre2013_134138", "pre2013_135669", "pre2013_127127", "pre…
## $ date        <dbl> 20050111, 20050126, 20041017, 20120601, 20041021, 20060411…
## $ year        <dbl> 2005, 2005, 2004, 2012, 2004, 2006, 2007, 2017, 2009, 2006…
## $ nhood       <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ city        <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ county      <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ price       <dbl> 1250, 1295, 1100, 1425, 890, 825, 1500, 2925, 450, 1395, 1…
## $ beds        <dbl> 2, 2, 2, 1, 1, 1, 1, 3, NA, 2, 2, 5, 4, 0, 4, 1, 3, 3, 1, …
## $ baths       <dbl> 2, NA, NA, NA, NA, NA, 1, NA, 1, NA, NA, NA, 3, NA, NA, NA…
## $ sqft        <dbl> NA, NA, NA, 735, NA, NA, NA, NA, NA, NA, NA, 2581, 1756, N…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 37.53494, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ title       <chr> "$1250 / 2br - 2BR/2BA   1145 ALAMEDA DE LAS PULGAS", "$12…
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ details     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "<p class=…

colnames(rent)

##  [1] "post_id"     "date"        "year"        "nhood"       "city"       
##  [6] "county"      "price"       "beds"        "baths"       "sqft"       
## [11] "room_in_apt" "address"     "lat"         "lon"         "title"      
## [16] "descr"       "details"

rent %>% tabyl(baths) #158121 NAs for baths -> need to remove these

##  baths      n      percent valid_percent
##    1.0  17644 8.787028e-02  4.134505e-01
##    1.5   2710 1.349628e-02  6.350322e-02
##    2.0  16172 8.053945e-02  3.789572e-01
##    2.5   3309 1.647941e-02  7.753954e-02
##    3.0   2016 1.004004e-02  4.724077e-02
##    3.5    386 1.922349e-03  9.045108e-03
##    4.0    258 1.284886e-03  6.045694e-03
##    4.5     73 3.635531e-04  1.710603e-03
##    5.0     71 3.535927e-04  1.663738e-03
##    5.5     17 8.466304e-05  3.983597e-04
##    6.0     11 5.478197e-05  2.577622e-04
##    6.5      4 1.992072e-05  9.373169e-05
##    7.0      2 9.960358e-06  4.686585e-05
##    7.5      1 4.980179e-06  2.343292e-05
##    8.0      1 4.980179e-06  2.343292e-05
##     NA 158121 7.874709e-01            NA

sum(is.na(rent$sqft)) #136117 NAs for sqft

## [1] 136117

sum(is.na(rent$price)) #No NAs for price

## [1] 0

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

I need to remove all the observations that have missing values for either baths or sqft.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

rent2 <- rent
#Remove obs with NAs for baths or sqft and save as rent2
rent2 <- rent %>% drop_na(baths)%>% drop_na(sqft)

#Baths
rent2 %>% tabyl(baths) #currently listed as a continuous variable, but should be categorical with values from 1.0 to 8.0 by 0.5 increments

##  baths    n      percent
##    1.0 5496 3.749233e-01
##    1.5  778 5.307320e-02
##    2.0 5654 3.857016e-01
##    2.5 1398 9.536803e-02
##    3.0  914 6.235077e-02
##    3.5  219 1.493963e-02
##    4.0  108 7.367488e-03
##    4.5   36 2.455829e-03
##    5.0   39 2.660482e-03
##    5.5    7 4.775223e-04
##    6.0    4 2.728699e-04
##    6.5    3 2.046524e-04
##    7.0    2 1.364350e-04
##    8.0    1 6.821748e-05

rent2 <- rent2 %>% mutate(baths=factor(baths))

typeof(rent2$baths) #was double -> now an integer

## [1] "integer"

levels(rent2$baths) #levels are correct

##  [1] "1"   "1.5" "2"   "2.5" "3"   "3.5" "4"   "4.5" "5"   "5.5" "6"   "6.5"
## [13] "7"   "8"

#Neighborhood, city, and county should capitalized, but they're not, so I'll use the str_to_title function to capitalize the data and across() to select these 3 columns.
rent2 <- rent2 %>%
    mutate(
      across(.cols = c("nhood", "city", "county"),
             .fns = str_to_title)
      ) %>%
  glimpse()

## Rows: 14,659
## Columns: 17
## $ post_id     <chr> "4168358289", "pre2013_59350", "pre2013_72024", "pre2013_6…
## $ date        <dbl> 20131103, 20120318, 20120729, 20120402, 20161009, 20120314…
## $ year        <dbl> 2013, 2012, 2012, 2012, 2016, 2012, 2012, 2013, 2016, 2014…
## $ nhood       <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ city        <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ county      <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ price       <dbl> 2595, 1375, 1950, 1640, 3100, 1000, 1555, 2150, 2100, 4200…
## $ beds        <dbl> 4, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 2, 2, 2, 2, 4, 2, 2, 1…
## $ baths       <fct> 3, 1, 2, 1.5, 1, 1, 2.5, 3, 1, 2.5, 2.5, 2.5, 1, 1, 1, 1, …
## $ sqft        <dbl> 1756, 700, 1400, 895, 1200, 1185, 1626, 1014, 752, 2205, 1…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> 37.53494, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -122.2…
## $ title       <chr> "Nov  2 Newly remodeled 4br/3ba (2 mastersuites) in quiet …
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Summe…
## $ details     <chr> "<p class=\"row\" data-latitude=\"37.534936\" data-longitu…

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(rent2) #baths is fct now in rent2 whereas it was dbl in rent

## Rows: 14,659
## Columns: 17
## $ post_id     <chr> "4168358289", "pre2013_59350", "pre2013_72024", "pre2013_6…
## $ date        <dbl> 20131103, 20120318, 20120729, 20120402, 20161009, 20120314…
## $ year        <dbl> 2013, 2012, 2012, 2012, 2016, 2012, 2012, 2013, 2016, 2014…
## $ nhood       <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ city        <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ county      <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Al…
## $ price       <dbl> 2595, 1375, 1950, 1640, 3100, 1000, 1555, 2150, 2100, 4200…
## $ beds        <dbl> 4, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 2, 2, 2, 2, 4, 2, 2, 1…
## $ baths       <fct> 3, 1, 2, 1.5, 1, 1, 2.5, 3, 1, 2.5, 2.5, 2.5, 1, 1, 1, 1, …
## $ sqft        <dbl> 1756, 700, 1400, 895, 1200, 1185, 1626, 1014, 752, 2205, 1…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> 37.53494, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -122.2…
## $ title       <chr> "Nov  2 Newly remodeled 4br/3ba (2 mastersuites) in quiet …
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Summe…
## $ details     <chr> "<p class=\"row\" data-latitude=\"37.534936\" data-longitu…

rent2 %>% skim() #confirms that all the NAs in baths and sqft have been dropped

Data summary
Name	Piped data
Number of rows	14659
Number of columns	17
_______________________
Column type frequency:
character	8
factor	1
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
post_id	0	1.00	10	14	14659
nhood	0	1.00	4	38	146
city	0	1.00	5	14	88
county	103	0.99	4	13	10
address	12872	0.12	1	38	1389
title	134	0.99	6	298	14203
descr	12778	0.13	44	8880	1739
details	12064	0.18	11	593	2436

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
baths	0	1	FALSE	14	2: 5654, 1: 5496, 2.5: 1398, 3: 914

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
date	0	1.00	20136779.10	25138.89	20010625.00	20120216.00	20140824.00	20160126.00	20180717.00	▁▁▂▇▇
year	0	1.00	2013.62	2.52	2001.00	2012.00	2014.00	2016.00	2018.00	▁▁▂▇▇
price	0	1.00	2784.55	1583.52	350.00	1850.00	2450.00	3250.00	24000.00	▇▁▁▁▁
beds	30	1.00	2.32	1.03	0.00	2.00	2.00	3.00	8.00	▂▇▁▁▁
sqft	0	1.00	1272.56	697.82	110.00	885.50	1100.00	1500.00	22000.00	▇▁▁▁▁
room_in_apt	0	1.00	0.00	0.05	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
lat	12143	0.17	37.63	0.34	33.79	37.36	37.65	37.79	39.07	▁▁▁▇▁
lon	12661	0.14	-122.16	0.26	-123.20	-122.40	-122.10	-121.95	-121.21	▁▅▇▇▁

head(rent2) #Shows that the values are what I expected based on possible values for each of the variables

## # A tibble: 6 × 17
##   post_id          date  year nhood city  county price  beds baths  sqft room_…¹
##   <chr>           <dbl> <dbl> <chr> <chr> <chr>  <dbl> <dbl> <fct> <dbl>   <dbl>
## 1 4168358289     2.01e7  2013 Alam… Alam… Alame…  2595     4 3      1756       0
## 2 pre2013_59350  2.01e7  2012 Alam… Alam… Alame…  1375     2 1       700       0
## 3 pre2013_72024  2.01e7  2012 Alam… Alam… Alame…  1950     3 2      1400       0
## 4 pre2013_64956  2.01e7  2012 Alam… Alam… Alame…  1640     2 1.5     895       0
## 5 5819631756     2.02e7  2016 Alam… Alam… Alame…  3100     2 1      1200       0
## 6 pre2013_24201  2.01e7  2012 Alam… Alam… Alame…  1000     2 1      1185       0
## # … with 6 more variables: address <chr>, lat <dbl>, lon <dbl>, title <chr>,
## #   descr <chr>, details <chr>, and abbreviated variable name ¹room_in_apt

Are the values what you expected for the variables? Why or Why not?

Yes, because they’re all plausible values for the the kinds of variables I have.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#Price
#Check out how the mean price is different based on # of baths
rent2 %>% group_by(baths) %>% summarise(n=n(), mean_price=mean(price))

## # A tibble: 14 × 3
##    baths     n mean_price
##    <fct> <int>      <dbl>
##  1 1      5496      2259.
##  2 1.5     778      2480.
##  3 2      5654      2861.
##  4 2.5    1398      3168.
##  5 3       914      3764.
##  6 3.5     219      5277.
##  7 4       108      5464.
##  8 4.5      36      6828.
##  9 5        39      5338.
## 10 5.5       7     13557.
## 11 6         4      9188.
## 12 6.5       3     12500 
## 13 7         2     20000 
## 14 8         1     20950

#This shows that mean price goes up with number of baths

#Sq Ft
summary(rent2$sqft)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   110.0   885.5  1100.0  1272.6  1500.0 22000.0

rent2 %>% group_by(baths) %>% summarise(n=n(), mean_sqft=mean(sqft))

## # A tibble: 14 × 3
##    baths     n mean_sqft
##    <fct> <int>     <dbl>
##  1 1      5496      882.
##  2 1.5     778     1120.
##  3 2      5654     1289.
##  4 2.5    1398     1739.
##  5 3       914     2179.
##  6 3.5     219     2554.
##  7 4       108     3028.
##  8 4.5      36     3473.
##  9 5        39     3370.
## 10 5.5       7     4906.
## 11 6         4     4082.
## 12 6.5       3     5130 
## 13 7         2     6600 
## 14 8         1     3400

#This shows that mean sq footage goes up with number of baths for the most part, though the apt with 8 bathrooms does not follow this trend bc it's an odd outlier. I mean, who would put 8 bathrooms in a 3400 sq ft apt??

What are your findings about the summary? Are they what you expected?

Yes, they are what I expected because as number of baths goes up, price goes up.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(rent2, aes(baths, price))+
  geom_boxplot()+
  labs(x="Number of Bathrooms", y="Rental Price per Month ($)", title = "Rental Price vs Number of Bathrooms in San Francisco Apartments")+ theme_bw()

ggplot(rent2, aes(sqft, price))+
  geom_point()+
  labs(x="Size (Sq ft)", y="Rental Price per Month ($)", title = "Rental Price vs Size of San Francisco Apartments")+ theme_dark()

library(gtsummary)
model1 <- lm(price~baths, data = rent2)
model2 <- lm(price~baths + sqft, data = rent2)
tbl_regression(model1)

Characteristic	Beta	95% CI¹	p-value
baths
1	—	—
1.5	221	115, 326	<0.001
2	601	549, 653	<0.001
2.5	909	826, 991	<0.001
3	1,504	1,406, 1,603	<0.001
3.5	3,018	2,828, 3,207	<0.001
4	3,205	2,937, 3,472	<0.001
4.5	4,569	4,108, 5,029	<0.001
5	3,079	2,637, 3,522	<0.001
5.5	11,298	10,257, 12,339	<0.001
6	6,928	5,551, 8,305	<0.001
6.5	10,241	8,651, 11,831	<0.001
7	17,741	15,794, 19,688	<0.001
8	18,691	15,937, 21,444	<0.001
¹ CI = Confidence Interval

tbl_regression(model2)

Characteristic	Beta	95% CI¹	p-value
baths
1	—	—
1.5	25	-76, 126	0.6
2	266	214, 319	<0.001
2.5	205	119, 291	<0.001
3	439	331, 548	<0.001
3.5	1,645	1,451, 1,839	<0.001
4	1,443	1,173, 1,713	<0.001
4.5	2,442	1,990, 2,893	<0.001
5	1,036	602, 1,470	<0.001
5.5	7,994	6,988, 9,000	<0.001
6	4,301	2,982, 5,619	<0.001
6.5	6,753	5,228, 8,278	<0.001
7	13,046	11,176, 14,917	<0.001
8	16,623	13,998, 19,249	<0.001
sqft	0.82	0.78, 0.86	<0.001
¹ CI = Confidence Interval

Final Summary (10 points)

Summarize your research question and findings below. Research question: How does the number of bathrooms affect rent price in San Francisco?

Findings: Rent price increases as the number of bathrooms increases. Rent price also increases as the size of the apt increases.

Based on my regression models, even when apt size is controlled for, the number of bathrooms is still significantly associated with higher rent prices. When sq footage is controlled for, the association between number of baths and rent price is slightly weaker, but number of baths has a much stronger association with rent price than does sq footage.

Are your findings what you expected? Why or Why not? Yes - bathrooms are expensive to construct and highly coveted, so it makes sense they would increase the rental value.