Please submit your .Rmd
and .html
files in
Sakai. If you are working together, both people should submit the
files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the
easiest to deatl with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Given your question, what is your expectation about the data?
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
# Get the Data
# Read in with tidytuesdayR package
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest
# Either ISO-8601 date or year/week works!
billboard <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv')
## Rows: 327895 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): url, week_id, song, performer, song_id
## dbl (5): week_position, instance, previous_week_position, peak_position, wee...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
audio <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv')
## Rows: 29503 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): song_id, performer, song, spotify_genre, spotify_track_id, spotify...
## dbl (14): spotify_track_duration_ms, danceability, energy, key, loudness, mo...
## lgl (1): spotify_track_explicit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(billboard)
## Rows: 327,895
## Columns: 10
## $ url <chr> "http://www.billboard.com/charts/hot-100/1965-0…
## $ week_id <chr> "7/17/1965", "7/24/1965", "7/31/1965", "8/7/196…
## $ week_position <dbl> 34, 22, 14, 10, 8, 8, 14, 36, 97, 90, 97, 97, 9…
## $ song <chr> "Don't Just Stand There", "Don't Just Stand The…
## $ performer <chr> "Patty Duke", "Patty Duke", "Patty Duke", "Patt…
## $ song_id <chr> "Don't Just Stand TherePatty Duke", "Don't Just…
## $ instance <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ previous_week_position <dbl> 45, 34, 22, 14, 10, 8, 8, 14, NA, 97, 90, 97, 9…
## $ peak_position <dbl> 34, 22, 14, 10, 8, 8, 8, 8, 97, 90, 90, 90, 90,…
## $ weeks_on_chart <dbl> 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 1, …
skim(billboard)
Name | billboard |
Number of rows | 327895 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
url | 0 | 1 | 50 | 51 | 0 | 3279 | 0 |
week_id | 0 | 1 | 8 | 10 | 0 | 3279 | 0 |
song | 0 | 1 | 1 | 75 | 0 | 24360 | 0 |
performer | 0 | 1 | 1 | 113 | 0 | 10061 | 0 |
song_id | 0 | 1 | 5 | 123 | 0 | 29389 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
week_position | 0 | 1.0 | 50.50 | 28.87 | 1 | 25.5 | 50 | 75 | 100 | ▇▇▇▇▇ |
instance | 0 | 1.0 | 1.07 | 0.33 | 1 | 1.0 | 1 | 1 | 10 | ▇▁▁▁▁ |
previous_week_position | 31954 | 0.9 | 47.60 | 28.06 | 1 | 23.0 | 47 | 72 | 100 | ▇▇▇▇▆ |
peak_position | 0 | 1.0 | 41.36 | 29.54 | 1 | 14.0 | 39 | 66 | 100 | ▇▅▅▅▃ |
weeks_on_chart | 0 | 1.0 | 9.15 | 7.59 | 1 | 4.0 | 7 | 13 | 87 | ▇▁▁▁▁ |
glimpse(audio)
## Rows: 29,503
## Columns: 22
## $ song_id <chr> "-twistin'-White Silver SandsBill Black's Co…
## $ performer <chr> "Bill Black's Combo", "Augie Rios", "Andy Wi…
## $ song <chr> "-twistin'-White Silver Sands", "¿Dònde Està…
## $ spotify_genre <chr> "[]", "['novelty']", "['adult standards', 'b…
## $ spotify_track_id <chr> NA, NA, "3tvqPPpXyIgKrm4PR9HCf0", "1fHHq3qHU…
## $ spotify_track_preview_url <chr> NA, NA, "https://p.scdn.co/mp3-preview/cef48…
## $ spotify_track_duration_ms <dbl> NA, NA, 166106, 172066, 211066, 208186, 2055…
## $ spotify_track_explicit <lgl> NA, NA, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
## $ spotify_track_album <chr> NA, NA, "The Essential Andy Williams", "Comp…
## $ danceability <dbl> NA, NA, 0.154, 0.588, 0.759, 0.613, NA, 0.64…
## $ energy <dbl> NA, NA, 0.185, 0.672, 0.699, 0.764, NA, 0.68…
## $ key <dbl> NA, NA, 5, 11, 0, 2, NA, 2, NA, NA, 7, NA, 1…
## $ loudness <dbl> NA, NA, -14.063, -17.278, -5.745, -6.509, NA…
## $ mode <dbl> NA, NA, 1, 0, 0, 1, NA, 0, NA, NA, 1, NA, 0,…
## $ speechiness <dbl> NA, NA, 0.0315, 0.0361, 0.0307, 0.1360, NA, …
## $ acousticness <dbl> NA, NA, 0.91100, 0.00256, 0.20200, 0.05270, …
## $ instrumentalness <dbl> NA, NA, 2.67e-04, 7.45e-01, 1.31e-04, 0.00e+…
## $ liveness <dbl> NA, NA, 0.1120, 0.1450, 0.4430, 0.1970, NA, …
## $ valence <dbl> NA, NA, 0.150, 0.801, 0.907, 0.417, NA, 0.95…
## $ tempo <dbl> NA, NA, 83.969, 121.962, 92.960, 160.015, NA…
## $ time_signature <dbl> NA, NA, 4, 4, 4, 4, NA, 4, NA, NA, 4, NA, 4,…
## $ spotify_track_popularity <dbl> NA, NA, 38, 11, 77, 73, 61, 40, NA, NA, 31, …
skim(audio)
Name | audio |
Number of rows | 29503 |
Number of columns | 22 |
_______________________ | |
Column type frequency: | |
character | 7 |
logical | 1 |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
song_id | 0 | 1.00 | 5 | 123 | 0 | 29386 | 0 |
performer | 0 | 1.00 | 1 | 113 | 0 | 10059 | 0 |
song | 0 | 1.00 | 1 | 75 | 0 | 24359 | 0 |
spotify_genre | 1600 | 0.95 | 2 | 319 | 0 | 3375 | 0 |
spotify_track_id | 5106 | 0.83 | 22 | 22 | 0 | 24177 | 0 |
spotify_track_preview_url | 15012 | 0.49 | 107 | 107 | 0 | 14371 | 0 |
spotify_track_album | 5112 | 0.83 | 1 | 2688 | 0 | 13125 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
spotify_track_explicit | 5106 | 0.83 | 0.12 | FAL: 21449, TRU: 2948 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
spotify_track_duration_ms | 5106 | 0.83 | 220684.26 | 67746.71 | 29688.00 | 175053.00 | 214850.00 | 253253.00 | 3079157.00 | ▇▁▁▁▁ |
danceability | 5169 | 0.82 | 0.60 | 0.15 | 0.00 | 0.50 | 0.61 | 0.71 | 0.99 | ▁▂▆▇▂ |
energy | 5169 | 0.82 | 0.62 | 0.20 | 0.00 | 0.48 | 0.63 | 0.78 | 1.00 | ▁▃▆▇▅ |
key | 5169 | 0.82 | 5.23 | 3.56 | 0.00 | 2.00 | 5.00 | 8.00 | 11.00 | ▇▃▃▅▆ |
loudness | 5169 | 0.82 | -8.66 | 3.60 | -28.03 | -11.03 | -8.20 | -5.86 | 2.29 | ▁▁▅▇▁ |
mode | 5169 | 0.82 | 0.73 | 0.45 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
speechiness | 5169 | 0.82 | 0.07 | 0.08 | 0.00 | 0.03 | 0.04 | 0.07 | 0.95 | ▇▁▁▁▁ |
acousticness | 5169 | 0.82 | 0.29 | 0.28 | 0.00 | 0.05 | 0.20 | 0.51 | 0.99 | ▇▃▂▂▁ |
instrumentalness | 5169 | 0.82 | 0.03 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 0.98 | ▇▁▁▁▁ |
liveness | 5169 | 0.82 | 0.19 | 0.16 | 0.01 | 0.09 | 0.13 | 0.25 | 1.00 | ▇▂▁▁▁ |
valence | 5169 | 0.82 | 0.60 | 0.24 | 0.00 | 0.42 | 0.62 | 0.80 | 0.99 | ▂▅▇▇▇ |
tempo | 5169 | 0.82 | 120.28 | 28.05 | 0.00 | 99.06 | 118.91 | 136.48 | 241.01 | ▁▃▇▂▁ |
time_signature | 5169 | 0.82 | 3.93 | 0.32 | 0.00 | 4.00 | 4.00 | 4.00 | 5.00 | ▁▁▁▇▁ |
spotify_track_popularity | 5106 | 0.83 | 41.22 | 22.48 | 0.00 | 23.00 | 43.00 | 59.00 | 100.00 | ▆▆▇▆▁ |
audio_songs <- audio %>% select(song, performer, spotify_genre) #created a new dataset and selecting only the ones that I need for my research question
billboard_list <- billboard %>% select(week_id, song, performer) #created a new dataset and selecting only the ones that I need for my research question
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
# this keeps the data frame on the left and matches record on the right data frame. I've chose to left_join with billboard list first as I wanted to keep my dates, which was the most important for my research question.
song_list <-
billboard_list %>%
left_join(audio_songs)
## Joining, by = c("song", "performer")
song_list <-na.omit(song_list[song_list$spotify_genre != "[]",]) #remove songs that doesn't contain any genre
song_list <- cSplit(song_list, 'spotify_genre', ',', type.convert = FALSE) #split multiple listed genres into separate ones, this is done by dividing anything that has a "," between. Since each genre has a comma between them, this will allow me to separate multiple genres apart.
song_list <- song_list %>%
mutate(dates = as.Date(week_id, format = "%m/%d/%Y")) #I wanted to convert the week_id into dates so eventually I can group my song genres by their corresponding dates. This allows me to create a column of dates that will actually be in "dates" rather than character.
song_list_years <- song_list %>% # now that I have dates, I mutated another new column where I can separate each song genre by their decade, going from the 50's to the present.
mutate(
"song year" = case_when(
dates < "1960-01-01" ~ "50's",
dates >= "1960-01-01" & dates < "1970-01-01" ~ "60's",
dates >= "1970-01-01" & dates < "1980-01-01" ~ "70's",
dates >= "1980-01-01" & dates < "1990-01-01" ~ "80's",
dates >= "1990-01-01" & dates < "2000-01-01" ~ "90's",
dates >= "2000-01-01" & dates < "2010-01-01" ~ "00's",
dates >= "2010-01-01" & dates < "2020-01-01" ~ "2010's",
dates >= "2020-01-01" ~ "2020's"
)
)
song_list_years$`song year` <- factor(song_list_years$`song year`, levels = c("50's", "60's", "70's", "80's", "90's", "00's", "2010's", "2020's")) # After creating my decades, I've decided to factor each decade so I'm able to organize my data easier.
song_list_years <- song_list_years[!duplicated(song_list_years$song),] #I wanted to remove any duplicate song because my goal is to just see which individual songs were played mainly in each decade. I didn't want to have duplicate because some songs were listed multiple times which I felt wouldn't have been good indicators if the same song were always played.
song_list_years <- song_list_years %>% #since there is an abundance of genres, I've converted some of these genres into bigger names for simplicity
mutate(
"genre" = case_when(
grepl("blues", spotify_genre_01) ~ "Blues",
grepl("jazz|bebop|post-bop|boogie-woogie|bossa nova", spotify_genre_01) ~ "Jazz",
grepl("electro|edm|house|techno|freestyle|disco|dub|big room", spotify_genre_01) ~ "Electronic",
grepl("rock|metal|grunge|invasion|punk|mellow gold|alt z|emo|permanent wave", spotify_genre_01) ~ "Rock",
grepl("country|honky|cowboy western|bluegrass|red dirt|nashville", spotify_genre_01) ~ "Country",
grepl("rhythm and blues|rb|doowop|funk|soul|gospel|r&b|quiet storm|motown", spotify_genre_01) ~ "R&B",
grepl("hip hop|rap|trap|bounce|crunk|New jack swing|miami bass", spotify_genre_01) ~ "Hip Hop",
grepl("pop|standard", spotify_genre_01) ~ "Pop",
grepl("classical|orchestra|instrument", spotify_genre_01) ~ "Classical",
TRUE ~ "Others"
)
)
Genre_Year <- song_list_years %>% select(`song year`,genre) #after tweaking my dataset, I've decided to keep the the main things I needed to proceed with my analysis and these are the ones I mainly needed.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
glimpse(Genre_Year)
## Rows: 21,214
## Columns: 2
## $ `song year` <fct> 60's, 90's, 70's, 70's, 80's, 80's, 00's, 90's, 70's, 80's…
## $ genre <chr> "Pop", "R&B", "Pop", "Pop", "Rock", "Rock", "Pop", "Countr…
head(Genre_Year)
## song year genre
## 1: 60's Pop
## 2: 90's R&B
## 3: 70's Pop
## 4: 70's Pop
## 5: 80's Rock
## 6: 80's Rock
Are the values what you expected for the variables? Why or Why not?
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
Genre_Year %>% group_by(`song year`, genre) %>% summarise(n=n())
## `summarise()` has grouped output by 'song year'. You can override using the
## `.groups` argument.
## # A tibble: 72 × 3
## # Groups: song year [8]
## `song year` genre n
## <fct> <chr> <int>
## 1 50's Blues 34
## 2 50's Classical 6
## 3 50's Country 49
## 4 50's Hip Hop 7
## 5 50's Jazz 4
## 6 50's Others 75
## 7 50's Pop 348
## 8 50's R&B 7
## 9 50's Rock 26
## 10 60's Blues 179
## # … with 62 more rows
Genre_Year %>% tabyl(genre, `song year`) %>% adorn_totals() %>% gt()
genre | 50's | 60's | 70's | 80's | 90's | 00's | 2010's | 2020's |
---|---|---|---|---|---|---|---|---|
Blues | 34 | 179 | 109 | 27 | 4 | 6 | 2 | 0 |
Classical | 6 | 42 | 17 | 11 | 2 | 0 | 0 | 0 |
Country | 49 | 178 | 165 | 75 | 191 | 579 | 604 | 85 |
Electronic | 0 | 32 | 236 | 269 | 172 | 9 | 72 | 5 |
Hip Hop | 7 | 38 | 16 | 33 | 392 | 493 | 1166 | 222 |
Jazz | 4 | 71 | 26 | 27 | 10 | 0 | 0 | 0 |
Others | 75 | 469 | 272 | 197 | 407 | 269 | 358 | 40 |
Pop | 348 | 2610 | 1300 | 757 | 540 | 873 | 1068 | 115 |
R&B | 7 | 570 | 676 | 163 | 175 | 109 | 154 | 17 |
Rock | 26 | 360 | 987 | 1472 | 572 | 377 | 172 | 14 |
Total | 556 | 4549 | 3804 | 3031 | 2465 | 2715 | 3596 | 498 |
Genre_Year_Count <- Genre_Year %>% group_by(`song year`, genre) %>% tally()
What are your findings about the summary? Are they what you expected?
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
ggplot(Genre_Year_Count, aes(x = `song year`,
y = n,
fill = genre)) +
geom_bar(stat = "identity") +
labs(title = "Popular genre of music in years",
x = "Years (in decades)",
y = "Number of songs",
fill = "Music Genre")
ggplot(Genre_Year, aes(x = `song year`)) +
geom_bar(position = "dodge", fill = "Blue") +
facet_wrap(~genre, scales = "free", ncol = 5) +
scale_y_continuous(limit=c(0,3000)) +
theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
labs(title = "Individual genre popularity by years",
x = "Years (in decades)",
y = "Number of songs")
ggplot(Genre_Year_Count, aes(x = `song year`,
y = n,
group = 1
)) +
geom_line(color = "red") +
facet_wrap(~genre, scales = "free", ncol = 5) +
scale_y_continuous(limit=c(0,2800)) +
theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
labs(title = "Individual genre popularity by years",
x = "Years (in decades)",
y = "Number of songs")
ggplot(Genre_Year_Count) +
aes(x=`song year`, y=genre, fill=(n)) +
geom_tile() +
scale_fill_gradient(low = "purple", high = "red") +
labs(title = "Individual genre popularity by years",
x = "Years (in decades)",
y = "Popularity of genre",
fill = "Song count")
Summarize your research question and findings below.
Are your findings what you expected? Why or Why not?