Please submit your .Rmd
and .html
files in
Sakai. If you are working together, both people should submit the
files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the
easiest to deal with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I think it will be interesting to view the different data visualizations and see how the data changes based on my manipulations to the data. This also appears to be car data from India so it will be interesting to compare that to my knowledge of american car data. It will be interesting to see the value of cars in rupi’s. I am not going to convert the data from rupis to dollars because that conversion is constantly changing.
Research Question: Is there an association between selling price and km driven, by owner number and year?
Given your question, what is your expectation about the data? My hypothesis is that the price of the car will decrease with greater km driven, I also suspect that the price will trend lower if the car has more than one owner. I also think that the older cars will have more km’s diven and a lower selling price.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
#Importing Packages
#loading csv file into r~ data obtained from kaggle.com
car_details <- read_csv("~/desktop/car_details.csv",
na= c("missing","Missing","NA"))
## Rows: 4340 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, fuel, seller_type, transmission, owner
## dbl (3): year, selling_price, km_driven
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Addressed that when the data was imported, most of the missing information was coded as “na”
Make sure your data types are correct! Data types are numeric and categorical
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc. Looking at this in miles rather than km so it is easier to conceptualize
car_details_miles <- car_details%>%
mutate(
across(.cols = c(km_driven),
.fns = ~ .x/1.609344)) %>%
rename(miles_driven = km_driven)
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
car_details_miles %>%
glimpse()
## Rows: 4,340
## Columns: 8
## $ name <chr> "Maruti 800 AC", "Maruti Wagon R LXI Minor", "Hyundai Ve…
## $ year <dbl> 2007, 2007, 2012, 2017, 2014, 2007, 2016, 2014, 2015, 20…
## $ selling_price <dbl> 60000, 135000, 600000, 250000, 450000, 140000, 550000, 2…
## $ miles_driven <dbl> 43495.98, 31068.56, 62137.12, 28583.07, 87613.34, 77671.…
## $ fuel <chr> "Petrol", "Petrol", "Diesel", "Petrol", "Diesel", "Petro…
## $ seller_type <chr> "Individual", "Individual", "Individual", "Individual", …
## $ transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua…
## $ owner <chr> "First Owner", "First Owner", "First Owner", "First Owne…
observation: miles driven are less than km due to the transformation
Are the values what you expected for the variables? Why or Why not? Yes, these variables are what I expected. I expected that the values for miles would be less that that for km driven, becuase of the division done for the conversion.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#Looking at mean data by car year
car_details_miles %>%
group_by(year) %>%
summarise(across(c(selling_price, miles_driven), list(mean=mean)))
## # A tibble: 27 × 3
## year selling_price_mean miles_driven_mean
## <dbl> <dbl> <dbl>
## 1 1992 50000 62137.
## 2 1995 95000 62137.
## 3 1996 225000 29515.
## 4 1997 93000 55923.
## 5 1998 214000 40130.
## 6 1999 73500 42068.
## 7 2000 81500 44078.
## 8 2001 117650. 52017.
## 9 2002 90714. 52846.
## 10 2003 86565. 50748.
## # … with 17 more rows
observation: as the years increase the selling price generally goes up and the number of miles goes down
#Looking at data by number of owners
car_details_miles %>%
group_by(owner) %>%
summarise(across(c(selling_price, miles_driven), list(mean=mean)))
## # A tibble: 5 × 3
## owner selling_price_mean miles_driven_mean
## <chr> <dbl> <dbl>
## 1 First Owner 598637. 34806.
## 2 Fourth & Above Owner 173901. 61602.
## 3 Second Owner 343891. 50818.
## 4 Test Drive Car 954294. 2582.
## 5 Third Owner 269474. 61705.
What are your findings about the summary? Are they what you expected? Looking at first the data by year,generally the older cars appear to have more miles on them, though there are some outliers in that trend. I didn’t expect there to be outliers in the ninties for miles driven, this could be based on sample size, a smaller cell count for those years may have a larger effect on the data.
Looking at the data stratified by number of owners, I suspected that an increase in number of owners would have a higher average of miles driven and a decrease in average price. The lower number of owners do appear to have a higher average of selling price. The increase in number of owners also results in a higher number of miles driven. The summarized results by number of owners matched what I hypothesized for the data.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
#Looking at Selling Price vs Miles Driven
ggplot(car_details_miles) +
aes(x = selling_price,
y = miles_driven) +
geom_point() +
labs(title = "Selling Price versus Miles Driven",
x = "Selling Price",
y = "Miles Driven")
#Chart by year
ggplot(car_details_miles) +
aes(x = selling_price,
y = miles_driven,
color = year) +
geom_point() +
labs(title = "Selling Price versus Miles Driven by Year",
x = "Selling Price",
y = "Miles Driven")+
scale_color_gradient(low = "black", high="pink")
#Chart by Owner
ggplot(car_details_miles) +
aes(x = selling_price,
y = miles_driven,
color = owner) +
geom_point() +
labs(title = "Selling Price versus Miles Driven by Number of Owners",
x = "Selling Price",
y = "Miles Driven")
Summarize your research question and findings below.
My research question was examining the association between selling price of car and miles driven. I also looked at this association stratified by number of owners and year of car. My findings for the associarion between selling price and miles driven were that as the miles driven increased the selling price decreased. And that high priced cars all had low miles. When looking at this data stratified by year, the trend shows that newer cars have lower miles and generally higher selling prices, and conversley older cars have lower selling prices and higher miles driven. Examening the data stratified by number of owners, there is not as much of a clear pattern in the scatter plot. Though looking at the table of averages for selling price and miles driven, the price decreases as the number of owners goes up, the miles driven increases as the number of owners increase.
Are your findings what you expected? Why or Why not? These findings were generally what I hypothesized. I expected that the selling price of the car would go down as the number of miles driven increased, this matched the findings for the overall averages. I did think there would be more of a clear pattern apparent in the scatter plot. I also expected that older cars would have higher milage and lower selling price, which was shown in the data. I also thought that as the number of owners increased the miles driven would increase and the price would decrease, which was shown in the data.