Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data:
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deatl with.
You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())
function()

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

For my research, I’ve decided to use billboard’s top 100 songs from the tidytuesdayr package. I’m hoping to see which genre of music was most popular in each decade and potentially see the rise and fall of pop music. I’m drawn into this data set because music has always been a big part of my life and I enjoy listening to a variety of music. I’m also interested in seeing how music has change or grown, so I think it’ll be fun to see how music taste has changed over the years.

Given your question, what is your expectation about the data?

I’m expecting to see a mixture of genres be popular in the 50’s & 60’s, then a rise in disco music in the 70’s and 80’s. Finally R&B will become popular in the 90’s while rap being popular in the 2000’s. We will then finally see a rise in pop music in 2010 to present. I also believe we will see a gradual rise in pop music throughout the years.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!


billboard <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv')

## Rows: 327895 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): url, week_id, song, performer, song_id
## dbl (5): week_position, instance, previous_week_position, peak_position, wee...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

audio <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv')

## Rows: 29503 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): song_id, performer, song, spotify_genre, spotify_track_id, spotify...
## dbl (14): spotify_track_duration_ms, danceability, energy, key, loudness, mo...
## lgl  (1): spotify_track_explicit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(billboard)

## Rows: 327,895
## Columns: 10
## $ url                    <chr> "http://www.billboard.com/charts/hot-100/1965-0…
## $ week_id                <chr> "7/17/1965", "7/24/1965", "7/31/1965", "8/7/196…
## $ week_position          <dbl> 34, 22, 14, 10, 8, 8, 14, 36, 97, 90, 97, 97, 9…
## $ song                   <chr> "Don't Just Stand There", "Don't Just Stand The…
## $ performer              <chr> "Patty Duke", "Patty Duke", "Patty Duke", "Patt…
## $ song_id                <chr> "Don't Just Stand TherePatty Duke", "Don't Just…
## $ instance               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ previous_week_position <dbl> 45, 34, 22, 14, 10, 8, 8, 14, NA, 97, 90, 97, 9…
## $ peak_position          <dbl> 34, 22, 14, 10, 8, 8, 8, 8, 97, 90, 90, 90, 90,…
## $ weeks_on_chart         <dbl> 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 1, …

skim(billboard)

Data summary
Name	billboard
Number of rows	327895
Number of columns	10
_______________________
Column type frequency:
character	5
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
url	1	50	51	3279
week_id	1	8	10	3279
song	1	1	75	24360
performer	1	1	113	10061
song_id	1	5	123	29389

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
week_position	0	1.0	50.50	28.87	1	25.5	50	75	100	▇▇▇▇▇
instance	0	1.0	1.07	0.33	1	1.0	1	1	10	▇▁▁▁▁
previous_week_position	31954	0.9	47.60	28.06	1	23.0	47	72	100	▇▇▇▇▆
peak_position	0	1.0	41.36	29.54	1	14.0	39	66	100	▇▅▅▅▃
weeks_on_chart	0	1.0	9.15	7.59	1	4.0	7	13	87	▇▁▁▁▁

glimpse(audio)

## Rows: 29,503
## Columns: 22
## $ song_id                   <chr> "-twistin'-White Silver SandsBill Black's Co…
## $ performer                 <chr> "Bill Black's Combo", "Augie Rios", "Andy Wi…
## $ song                      <chr> "-twistin'-White Silver Sands", "¿Dònde Està…
## $ spotify_genre             <chr> "[]", "['novelty']", "['adult standards', 'b…
## $ spotify_track_id          <chr> NA, NA, "3tvqPPpXyIgKrm4PR9HCf0", "1fHHq3qHU…
## $ spotify_track_preview_url <chr> NA, NA, "https://p.scdn.co/mp3-preview/cef48…
## $ spotify_track_duration_ms <dbl> NA, NA, 166106, 172066, 211066, 208186, 2055…
## $ spotify_track_explicit    <lgl> NA, NA, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
## $ spotify_track_album       <chr> NA, NA, "The Essential Andy Williams", "Comp…
## $ danceability              <dbl> NA, NA, 0.154, 0.588, 0.759, 0.613, NA, 0.64…
## $ energy                    <dbl> NA, NA, 0.185, 0.672, 0.699, 0.764, NA, 0.68…
## $ key                       <dbl> NA, NA, 5, 11, 0, 2, NA, 2, NA, NA, 7, NA, 1…
## $ loudness                  <dbl> NA, NA, -14.063, -17.278, -5.745, -6.509, NA…
## $ mode                      <dbl> NA, NA, 1, 0, 0, 1, NA, 0, NA, NA, 1, NA, 0,…
## $ speechiness               <dbl> NA, NA, 0.0315, 0.0361, 0.0307, 0.1360, NA, …
## $ acousticness              <dbl> NA, NA, 0.91100, 0.00256, 0.20200, 0.05270, …
## $ instrumentalness          <dbl> NA, NA, 2.67e-04, 7.45e-01, 1.31e-04, 0.00e+…
## $ liveness                  <dbl> NA, NA, 0.1120, 0.1450, 0.4430, 0.1970, NA, …
## $ valence                   <dbl> NA, NA, 0.150, 0.801, 0.907, 0.417, NA, 0.95…
## $ tempo                     <dbl> NA, NA, 83.969, 121.962, 92.960, 160.015, NA…
## $ time_signature            <dbl> NA, NA, 4, 4, 4, 4, NA, 4, NA, NA, 4, NA, 4,…
## $ spotify_track_popularity  <dbl> NA, NA, 38, 11, 77, 73, 61, 40, NA, NA, 31, …

skim(audio)

Data summary
Name	audio
Number of rows	29503
Number of columns	22
_______________________
Column type frequency:
character	7
logical	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
song_id	0	1.00	5	123	29386
performer	0	1.00	1	113	10059
song	0	1.00	1	75	24359
spotify_genre	1600	0.95	2	319	3375
spotify_track_id	5106	0.83	22	22	24177
spotify_track_preview_url	15012	0.49	107	107	14371
spotify_track_album	5112	0.83	1	2688	13125

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
spotify_track_explicit	5106	0.83	0.12	FAL: 21449, TRU: 2948

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
spotify_track_duration_ms	5106	0.83	220684.26	67746.71	29688.00	175053.00	214850.00	253253.00	3079157.00	▇▁▁▁▁
danceability	5169	0.82	0.60	0.15	0.00	0.50	0.61	0.71	0.99	▁▂▆▇▂
energy	5169	0.82	0.62	0.20	0.00	0.48	0.63	0.78	1.00	▁▃▆▇▅
key	5169	0.82	5.23	3.56	0.00	2.00	5.00	8.00	11.00	▇▃▃▅▆
loudness	5169	0.82	-8.66	3.60	-28.03	-11.03	-8.20	-5.86	2.29	▁▁▅▇▁
mode	5169	0.82	0.73	0.45	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
speechiness	5169	0.82	0.07	0.08	0.00	0.03	0.04	0.07	0.95	▇▁▁▁▁
acousticness	5169	0.82	0.29	0.28	0.00	0.05	0.20	0.51	0.99	▇▃▂▂▁
instrumentalness	5169	0.82	0.03	0.14	0.00	0.00	0.00	0.00	0.98	▇▁▁▁▁
liveness	5169	0.82	0.19	0.16	0.01	0.09	0.13	0.25	1.00	▇▂▁▁▁
valence	5169	0.82	0.60	0.24	0.00	0.42	0.62	0.80	0.99	▂▅▇▇▇
tempo	5169	0.82	120.28	28.05	0.00	99.06	118.91	136.48	241.01	▁▃▇▂▁
time_signature	5169	0.82	3.93	0.32	0.00	4.00	4.00	4.00	5.00	▁▁▁▇▁
spotify_track_popularity	5106	0.83	41.22	22.48	0.00	23.00	43.00	59.00	100.00	▆▆▇▆▁

audio_songs <- audio %>% select(song, performer, spotify_genre) #created a new dataset and selecting only the ones that I need for my research question
billboard_list <- billboard %>% select(week_id, song, performer) #created a new dataset and selecting only the ones that I need for my research question

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

There was a lot of data that I didn’t really need to approach my analysis. I’ve selected only ones that were important to my research question to help make my data more cleaner.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# this keeps the data frame on the left and matches record on the right data frame. I've chose to left_join with billboard list first as I wanted to keep my dates, which was the most important for my research question. 
song_list <- 
  billboard_list %>%
  left_join(audio_songs)

## Joining, by = c("song", "performer")

song_list <-na.omit(song_list[song_list$spotify_genre != "[]",]) #remove songs that doesn't contain any genre 
song_list <- cSplit(song_list, 'spotify_genre', ',', type.convert = FALSE) #split multiple listed genres into separate ones, this is done by dividing anything that has a "," between. Since each genre has a comma between them, this will allow me to separate multiple genres apart.

song_list <- song_list %>%
  mutate(dates = as.Date(week_id, format = "%m/%d/%Y")) #I wanted to convert the week_id into dates so eventually I can group my song genres by their corresponding dates. This allows me to create a column of dates that will actually be in "dates" rather than character.

song_list_years <- song_list %>%  # now that I have dates, I mutated another new column where I can separate each song genre by their decade, going from the 50's to the present. 
  mutate(
    "song year" = case_when(
    dates < "1960-01-01" ~ "50's",
    dates >= "1960-01-01" & dates < "1970-01-01" ~ "60's",
    dates >= "1970-01-01" & dates < "1980-01-01" ~ "70's",
    dates >= "1980-01-01" & dates < "1990-01-01" ~ "80's",
    dates >= "1990-01-01" & dates < "2000-01-01" ~ "90's",
    dates >= "2000-01-01" & dates < "2010-01-01" ~ "00's",
    dates >= "2010-01-01" & dates < "2020-01-01" ~ "2010's",
    dates >= "2020-01-01" ~ "2020's"
    )
  ) 

song_list_years$`song year` <- factor(song_list_years$`song year`, levels = c("50's", "60's", "70's", "80's", "90's", "00's", "2010's", "2020's")) # After creating my decades, I've decided to factor each decade so I'm able to organize my data easier.

song_list_years <- song_list_years[!duplicated(song_list_years$song),] #I wanted to remove any duplicate song because my goal is to just see which individual songs were played mainly in each decade. I didn't want to have duplicate because some songs were listed multiple times which I felt wouldn't have been good indicators if the same song were always played.

song_list_years <- song_list_years %>%  #since there is an abundance of genres, I've converted some of these genres into bigger names for simplicity
  mutate(
    "genre" = case_when(
      grepl("blues", spotify_genre_01) ~ "Blues",
      grepl("jazz|bebop|post-bop|boogie-woogie|bossa nova", spotify_genre_01) ~ "Jazz",
      grepl("electro|edm|house|techno|freestyle|disco|dub|big room", spotify_genre_01) ~ "Electronic",
      grepl("rock|metal|grunge|invasion|punk|mellow gold|alt z|emo|permanent wave", spotify_genre_01) ~ "Rock",
      grepl("country|honky|cowboy western|bluegrass|red dirt|nashville", spotify_genre_01) ~ "Country",
      grepl("rhythm and blues|rb|doowop|funk|soul|gospel|r&b|quiet storm|motown", spotify_genre_01) ~ "R&B",
      grepl("hip hop|rap|trap|bounce|crunk|New jack swing|miami bass", spotify_genre_01) ~ "Hip Hop",
      grepl("pop|standard", spotify_genre_01) ~ "Pop",
      grepl("classical|orchestra|instrument", spotify_genre_01) ~ "Classical",
      TRUE ~ "Others"
    )
)

Genre_Year <- song_list_years %>% select(`song year`,genre) #after tweaking my dataset, I've decided to keep the the main things I needed to proceed with my analysis and these are the ones I mainly needed.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Within my data set, I did end up using a left_join. I could’ve used right_join as well since I’ve narrowed my data set small enough and it would’ve done the same thing, i just had to make sure my data frame’s were in opposite position. Left_joins just retains everything from the left data frame and then only match records from the other. I didn’t want to use inner_join as this only spits out matching records, so I would’ve lost my week_id and genre columns. I’ve ended up choosing to left join with my billboard_list data frame as I wanted to ensure I kept my dates.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(Genre_Year)

## Rows: 21,214
## Columns: 2
## $ `song year` <fct> 60's, 90's, 70's, 70's, 80's, 80's, 00's, 90's, 70's, 80's…
## $ genre       <chr> "Pop", "R&B", "Pop", "Pop", "Rock", "Rock", "Pop", "Countr…

head(Genre_Year)

##    song year genre
## 1:      60's   Pop
## 2:      90's   R&B
## 3:      70's   Pop
## 4:      70's   Pop
## 5:      80's  Rock
## 6:      80's  Rock

Are the values what you expected for the variables? Why or Why not?

After transforming the data, this wasn’t something I was expecting. I managed to obtain most of the information I needed but was still hoping to grab a count of all genres played for each decade. I removed most of the information in our data as I felt they were irrelevant to our research question. I’ve also had to adjust the genre as well, since there were a variety of them, I’ve only selected the first one listed as they’ll be considered the main genre for that song. I’ve also changed most of the genres into the bigger categories in music to simplify things since music genre is always evolving, an example would be categorizing rap with hip hop.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

Genre_Year %>% group_by(`song year`, genre) %>% summarise(n=n())

## `summarise()` has grouped output by 'song year'. You can override using the
## `.groups` argument.

## # A tibble: 72 × 3
## # Groups:   song year [8]
##    `song year` genre         n
##    <fct>       <chr>     <int>
##  1 50's        Blues        34
##  2 50's        Classical     6
##  3 50's        Country      49
##  4 50's        Hip Hop       7
##  5 50's        Jazz          4
##  6 50's        Others       75
##  7 50's        Pop         348
##  8 50's        R&B           7
##  9 50's        Rock         26
## 10 60's        Blues       179
## # … with 62 more rows

Genre_Year %>% tabyl(genre, `song year`) %>% adorn_totals() %>% gt()

genre	50's	60's	70's	80's	90's	00's	2010's	2020's
Blues	34	179	109	27	4	6	2	0
Classical	6	42	17	11	2	0	0	0
Country	49	178	165	75	191	579	604	85
Electronic	0	32	236	269	172	9	72	5
Hip Hop	7	38	16	33	392	493	1166	222
Jazz	4	71	26	27	10	0	0	0
Others	75	469	272	197	407	269	358	40
Pop	348	2610	1300	757	540	873	1068	115
R&B	7	570	676	163	175	109	154	17
Rock	26	360	987	1472	572	377	172	14
Total	556	4549	3804	3031	2465	2715	3596	498

Genre_Year_Count <- Genre_Year %>% group_by(`song year`, genre) %>% tally()

What are your findings about the summary? Are they what you expected?

After grouping our data, I was able to do a count to help me distinguish how many genres were played during each decade. After doing this, this is what I’ve expected for my data, however, the values weren’t what I expected though. i was surprised to see the popularity in rock music. I was also shocked to see that R&B and hip hop were lower than I’ve expected in the 90’s and 2000’s but shocked to see hip hop have a significant increase in 2010. I knew pop music was big in our country but didn’t expect to drop in the early 2000’s as well. However, we do see that pop music has a slow upcoming growth as we continue through the years but it did take a significant dip after the 60’s.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(Genre_Year_Count, aes(x = `song year`,
                             y = n,
                             fill = genre)) +
geom_bar(stat = "identity") +

  labs(title = "Popular genre of music in years",
       x = "Years (in decades)",
       y = "Number of songs",
       fill = "Music Genre")

ggplot(Genre_Year, aes(x = `song year`)) +
geom_bar(position = "dodge", fill = "Blue") +
  facet_wrap(~genre, scales = "free", ncol = 5) +
  scale_y_continuous(limit=c(0,3000)) + 
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) + 
  
      labs(title = "Individual genre popularity by years",
       x = "Years (in decades)",
       y = "Number of songs")

ggplot(Genre_Year_Count, aes(x = `song year`,
                       y = n,
                       group = 1
                       )) +
  geom_line(color = "red") +
  facet_wrap(~genre, scales = "free", ncol = 5) +
  scale_y_continuous(limit=c(0,2800)) + 
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
  
  labs(title = "Individual genre popularity by years",
       x = "Years (in decades)",
       y = "Number of songs")

ggplot(Genre_Year_Count) +
  aes(x=`song year`, y=genre, fill=(n)) +
  geom_tile() +
  scale_fill_gradient(low = "purple", high = "red") + 
  
    labs(title = "Individual genre popularity by years",
       x = "Years (in decades)",
       y = "Popularity of genre",
       fill = "Song count")

Final Summary (10 points)

Summarize your research question and findings below.

My research question was to find out what was the top genre of each decade according to Billboard’s Top 100. After my analysis, we are able to visualize and see which genre was most popular in each decade. With our table and graphs, we are also able to see the rise and fall of pop music. We see that rock music was becoming popular in the 70’s and reaching it’s peak in the 80’s. This would make sense as those years had really big rock groups, such as Aerosmith, Queens, and Metallica to name a few. We can also see pop music being really popular in the 60’s and start to significantly drop afterwards due to the popularity of rock music. Pop music eventually climbs back up but unfortunately, it seems like hip hop music is dominating our more recent years.

Are your findings what you expected? Why or Why not?

I’m quite surprised with my findings and it’s not what i was expecting. I was mainly expecting pop music to remain consistent and gradually grow over time but it turns out it was hip hop that actually had that trend. I was quite surprised that pop music dipped very low and didn’t expect rock music to be superior in the 80’s, I would’ve thought disco, which would’ve been categorized as electronic, to be popular during that time. Another interesting thing was seeing pop music wasn’t the popular genre for our recent years and it was actually hip hop. Pop did come quite close but seems like hip hop was been dominating the more recent years. Pop music seems to be the most popular genre in our timeline as they have extraordinary numbers. Overall, this data was taken from Billboards Top 100 which mean there could be some bias. This is because billboard ranking is determined by sales, radio play and streaming which can be heavily influenced by many factors.

Midterm

Brendon Tran

2023-02-19