Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data:
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deal with.
  • You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  • function()
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I think it will be interesting to view the different data visualizations and see how the data changes based on my manipulations to the data. This also appears to be car data from India so it will be interesting to compare that to my knowledge of american car data. It will be interesting to see the value of cars in rupi’s. I am not going to convert the data from rupis to dollars because that conversion is constantly changing.

Research Question: Is there an association between selling price and km driven, by owner number and year?

Given your question, what is your expectation about the data? My hypothesis is that the price of the car will decrease with greater km driven, I also suspect that the price will trend lower if the car has more than one owner. I also think that the older cars will have more km’s diven and a lower selling price.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#Importing Packages

#loading csv file into r~ data obtained from kaggle.com

car_details <- read_csv("~/desktop/car_details.csv", 
                             na= c("missing","Missing","NA")) 
## Rows: 4340 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, fuel, seller_type, transmission, owner
## dbl (3): year, selling_price, km_driven
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Addressed that when the data was imported, most of the missing information was coded as “na”

Make sure your data types are correct! Data types are numeric and categorical

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc. Looking at this in miles rather than km so it is easier to conceptualize

car_details_miles <- car_details%>%
    mutate(
        across(.cols = c(km_driven), 
               .fns = ~ .x/1.609344)) %>%
rename(miles_driven = km_driven)

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

car_details_miles %>%
  glimpse()
## Rows: 4,340
## Columns: 8
## $ name          <chr> "Maruti 800 AC", "Maruti Wagon R LXI Minor", "Hyundai Ve…
## $ year          <dbl> 2007, 2007, 2012, 2017, 2014, 2007, 2016, 2014, 2015, 20…
## $ selling_price <dbl> 60000, 135000, 600000, 250000, 450000, 140000, 550000, 2…
## $ miles_driven  <dbl> 43495.98, 31068.56, 62137.12, 28583.07, 87613.34, 77671.…
## $ fuel          <chr> "Petrol", "Petrol", "Diesel", "Petrol", "Diesel", "Petro…
## $ seller_type   <chr> "Individual", "Individual", "Individual", "Individual", …
## $ transmission  <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua…
## $ owner         <chr> "First Owner", "First Owner", "First Owner", "First Owne…

observation: miles driven are less than km due to the transformation

Are the values what you expected for the variables? Why or Why not? Yes, these variables are what I expected. I expected that the values for miles would be less that that for km driven, becuase of the division done for the conversion.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#Looking at mean data by car year

car_details_miles %>%
  group_by(year) %>%
    summarise(across(c(selling_price, miles_driven), list(mean=mean)))
## # A tibble: 27 × 3
##     year selling_price_mean miles_driven_mean
##    <dbl>              <dbl>             <dbl>
##  1  1992             50000             62137.
##  2  1995             95000             62137.
##  3  1996            225000             29515.
##  4  1997             93000             55923.
##  5  1998            214000             40130.
##  6  1999             73500             42068.
##  7  2000             81500             44078.
##  8  2001            117650.            52017.
##  9  2002             90714.            52846.
## 10  2003             86565.            50748.
## # … with 17 more rows

observation: as the years increase the selling price generally goes up and the number of miles goes down

#Looking at data by number of owners

car_details_miles %>%
  group_by(owner) %>%
    summarise(across(c(selling_price, miles_driven), list(mean=mean)))
## # A tibble: 5 × 3
##   owner                selling_price_mean miles_driven_mean
##   <chr>                             <dbl>             <dbl>
## 1 First Owner                     598637.            34806.
## 2 Fourth & Above Owner            173901.            61602.
## 3 Second Owner                    343891.            50818.
## 4 Test Drive Car                  954294.             2582.
## 5 Third Owner                     269474.            61705.

What are your findings about the summary? Are they what you expected? Looking at first the data by year,generally the older cars appear to have more miles on them, though there are some outliers in that trend. I didn’t expect there to be outliers in the ninties for miles driven, this could be based on sample size, a smaller cell count for those years may have a larger effect on the data.

Looking at the data stratified by number of owners, I suspected that an increase in number of owners would have a higher average of miles driven and a decrease in average price. The lower number of owners do appear to have a higher average of selling price. The increase in number of owners also results in a higher number of miles driven. The summarized results by number of owners matched what I hypothesized for the data.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#Looking at Selling Price vs Miles Driven

ggplot(car_details_miles) +
  aes(x = selling_price, 
      y = miles_driven) +
  geom_point() +
  labs(title = "Selling Price versus Miles Driven",
       x = "Selling Price",
       y = "Miles Driven")

#Chart by year

ggplot(car_details_miles) +
  aes(x = selling_price, 
      y = miles_driven, 
      color = year) +
  geom_point() +
  labs(title = "Selling Price versus Miles Driven by Year",
       x = "Selling Price",
       y = "Miles Driven")+
  scale_color_gradient(low = "black", high="pink")

#Chart by Owner

ggplot(car_details_miles) +
  aes(x = selling_price, 
      y = miles_driven, 
      color = owner) +
  geom_point() +
  labs(title = "Selling Price versus Miles Driven by Number of Owners",
       x = "Selling Price",
       y = "Miles Driven")

Final Summary (10 points)

Summarize your research question and findings below.

My research question was examining the association between selling price of car and miles driven. I also looked at this association stratified by number of owners and year of car. My findings for the associarion between selling price and miles driven were that as the miles driven increased the selling price decreased. And that high priced cars all had low miles. When looking at this data stratified by year, the trend shows that newer cars have lower miles and generally higher selling prices, and conversley older cars have lower selling prices and higher miles driven. Examening the data stratified by number of owners, there is not as much of a clear pattern in the scatter plot. Though looking at the table of averages for selling price and miles driven, the price decreases as the number of owners goes up, the miles driven increases as the number of owners increase.

Are your findings what you expected? Why or Why not? These findings were generally what I hypothesized. I expected that the selling price of the car would go down as the number of miles driven increased, this matched the findings for the overall averages. I did think there would be more of a clear pattern apparent in the scatter plot. I also expected that older cars would have higher milage and lower selling price, which was shown in the data. I also thought that as the number of owners increased the miles driven would increase and the price would decrease, which was shown in the data.