Define Your Research Question (10 points)
I have always loved arts and history. Visiting the Louvre in Paris was a highlight of my life. This inspired me to work on the “Art History data” from Tidytuesday( https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-17/readme.md). The data was collected to assess the demographic representation of artists through editions of Janson’s History of Art and Gardner’s Art Through the Ages, two of the most popular art history textbooks used in the American education system.
After exploring the data I became interested to find out the correlation between space occupied in the textbooks by the artists and having being exhibited in the Museum of Modern Arts (MoMA). I am curious to see if occupying more space in the two main textbooks corresponds to having greater exhibitions at the MoMA. I am also interested to see how this correlation might be different by different time range. My expectation is that there would be a correlation. I expect famous artists to be both featured in the books and have greater exhibitions at the MoMa.
Loading the Data (10 points)
Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.
#Data imported from tidytuesday
#artists <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience
#/tidytuesday/master/data/2023/2023-01-17/artists.csv')
#save data to permanent location in data subfolder of project folder
#write.csv(artists, file = here("data/artists.csv"))
#Uploading from the data folder
<- read_csv("data/artists.csv") artists
## New names:
## Rows: 3162 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): artist_name, artist_nationality, artist_nationality_other, artist_g... dbl
## (7): ...1, edition_number, year, space_ratio_per_page_total, artist_uniq...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Lets use glimpse
glimpse(artists)
## Rows: 3,162
## Columns: 15
## $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ artist_name <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
## $ edition_number <dbl> 9, 10, 11, 12, 13, 14, 15, 16, 14, 15, 16, …
## $ year <dbl> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
## $ artist_nationality <chr> "American", "American", "American", "Americ…
## $ artist_nationality_other <chr> "American", "American", "American", "Americ…
## $ artist_gender <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ artist_race <chr> "Black or African American", "Black or Afri…
## $ artist_ethnicity <chr> "Not Hispanic or Latino origin", "Not Hispa…
## $ book <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
## $ space_ratio_per_page_total <dbl> 0.3533658, 0.3739470, 0.3032593, 0.3770489,…
## $ artist_unique_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 6, 6, 6…
## $ moma_count_to_year <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ whitney_count_to_year <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ artist_race_nwi <chr> "Non-White", "Non-White", "Non-White", "Non…
Lets skim the data
skim(artists)
Name | artists |
Number of rows | 3162 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
artist_name | 0 | 1.00 | 4 | 99 | 0 | 413 | 0 |
artist_nationality | 0 | 1.00 | 3 | 18 | 0 | 52 | 0 |
artist_nationality_other | 0 | 1.00 | 5 | 8 | 0 | 6 | 0 |
artist_gender | 0 | 1.00 | 3 | 6 | 0 | 3 | 0 |
artist_race | 0 | 1.00 | 3 | 41 | 0 | 6 | 0 |
artist_ethnicity | 58 | 0.98 | 25 | 29 | 0 | 2 | 0 |
book | 0 | 1.00 | 6 | 7 | 0 | 2 | 0 |
artist_race_nwi | 0 | 1.00 | 5 | 9 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
…1 | 0 | 1 | 1581.50 | 912.94 | 1.00 | 791.25 | 1581.50 | 2371.75 | 3162.0 | ▇▇▇▇▇ |
edition_number | 0 | 1 | 8.22 | 4.40 | 1.00 | 5.00 | 8.00 | 12.00 | 16.0 | ▇▇▆▅▆ |
year | 0 | 1 | 1994.24 | 19.20 | 1926.00 | 1986.00 | 1996.00 | 2009.00 | 2020.0 | ▁▁▃▇▇ |
space_ratio_per_page_total | 0 | 1 | 0.53 | 0.39 | 0.09 | 0.31 | 0.41 | 0.59 | 3.8 | ▇▁▁▁▁ |
artist_unique_id | 0 | 1 | 201.76 | 114.18 | 1.00 | 108.00 | 189.00 | 305.75 | 413.0 | ▆▇▇▆▆ |
moma_count_to_year | 0 | 1 | 4.31 | 7.79 | 0.00 | 0.00 | 1.00 | 5.00 | 64.0 | ▇▁▁▁▁ |
whitney_count_to_year | 0 | 1 | 1.96 | 5.19 | 0.00 | 0.00 | 0.00 | 0.00 | 40.0 | ▇▁▁▁▁ |
This step shows that the data was loaded correctly. There are eight character variables (artist_name, artist_nationality, artist_nationality_other, artist_gender, artist_race, artist_ethnicity, book and artist_race_nwi) and seven numeric variables(row name, edition number, year, space_ratio_per_page_total, artist_unique_id, moma_count_to_year and whitney_count_to_year). There are 58 missing data in the “artist_ethnicity” variable. Missing data are already coded as “NA” in the original dataset. Some of the nationalities are coded as dual citizenship (e.g German-American). We will split them in two different columns. The variables artist_gender, artist_race and artist_nationality have some data coded as N/A. We will change these to NA. Even though we might not use these variables but it is good practice. We will also categorize the variable “year”. All our data types are already coded accurately. We will convert character vectors to factors as we code.
I just want to mention that all of this code is my own code that I modified from the class materials.
Transforming the data (15 points)
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.
Converting N/A to NA
<- artists %>% mutate(
artists artist_gender = na_if(artist_gender, "N/A"))
<- artists %>% mutate(
artists artist_race = na_if(artist_race, "N/A"))
<- artists %>% mutate(
artists artist_nationality = na_if(artist_nationality, "N/A"))
Making two new variables called nationality1 and nationality2 by splitting the column “artist_nationality”.
<-artists %>%
artists1separate(col= artist_nationality, into = c("nationality1", "nationality2"), sep = "-")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2983 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Making a new variable called ’year_cat” by categorizing the variable “year” into 20 year intervals.
<- artists1 %>% mutate(year_cat = year) %>%
artists1mutate(
year_cat = case_when(
>=1921) & (year<=1940) ~ "1921-1940",
(year >=1940) & (year<= 1960) ~ "1941-1960",
( year>=1960) & (year<= 1980) ~ "1961-1980",
( year>=1980) & (year<= 2000) ~ "1981-2000",
( year>=2001 ~ "2001-2020"
year
)%>% mutate(year_cat = factor(year_cat)) )
Make sure the data types are coded correctly!
The transformed table should include my new categorized “year_cat” variable as a factor. The variable nationality should be split into two columns. All the N/A s should be transformed to NA. We can use glimpse and skim to make sure everything is accurate.
glimpse(artists1)
## Rows: 3,162
## Columns: 17
## $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ artist_name <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
## $ edition_number <dbl> 9, 10, 11, 12, 13, 14, 15, 16, 14, 15, 16, …
## $ year <dbl> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
## $ nationality1 <chr> "American", "American", "American", "Americ…
## $ nationality2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ artist_nationality_other <chr> "American", "American", "American", "Americ…
## $ artist_gender <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ artist_race <chr> "Black or African American", "Black or Afri…
## $ artist_ethnicity <chr> "Not Hispanic or Latino origin", "Not Hispa…
## $ book <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
## $ space_ratio_per_page_total <dbl> 0.3533658, 0.3739470, 0.3032593, 0.3770489,…
## $ artist_unique_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 6, 6, 6…
## $ moma_count_to_year <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ whitney_count_to_year <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ artist_race_nwi <chr> "Non-White", "Non-White", "Non-White", "Non…
## $ year_cat <fct> 1981-2000, 1981-2000, 2001-2020, 2001-2020,…
skim(artists1)
Name | artists1 |
Number of rows | 3162 |
Number of columns | 17 |
_______________________ | |
Column type frequency: | |
character | 9 |
factor | 1 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
artist_name | 0 | 1.00 | 4 | 99 | 0 | 413 | 0 |
nationality1 | 23 | 0.99 | 4 | 17 | 0 | 39 | 0 |
nationality2 | 3006 | 0.05 | 6 | 8 | 0 | 3 | 0 |
artist_nationality_other | 0 | 1.00 | 5 | 8 | 0 | 6 | 0 |
artist_gender | 58 | 0.98 | 4 | 6 | 0 | 2 | 0 |
artist_race | 29 | 0.99 | 5 | 41 | 0 | 5 | 0 |
artist_ethnicity | 58 | 0.98 | 25 | 29 | 0 | 2 | 0 |
book | 0 | 1.00 | 6 | 7 | 0 | 2 | 0 |
artist_race_nwi | 0 | 1.00 | 5 | 9 | 0 | 2 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
year_cat | 0 | 1 | FALSE | 5 | 200: 1541, 198: 906, 196: 497, 194: 147 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
…1 | 0 | 1 | 1581.50 | 912.94 | 1.00 | 791.25 | 1581.50 | 2371.75 | 3162.0 | ▇▇▇▇▇ |
edition_number | 0 | 1 | 8.22 | 4.40 | 1.00 | 5.00 | 8.00 | 12.00 | 16.0 | ▇▇▆▅▆ |
year | 0 | 1 | 1994.24 | 19.20 | 1926.00 | 1986.00 | 1996.00 | 2009.00 | 2020.0 | ▁▁▃▇▇ |
space_ratio_per_page_total | 0 | 1 | 0.53 | 0.39 | 0.09 | 0.31 | 0.41 | 0.59 | 3.8 | ▇▁▁▁▁ |
artist_unique_id | 0 | 1 | 201.76 | 114.18 | 1.00 | 108.00 | 189.00 | 305.75 | 413.0 | ▆▇▇▆▆ |
moma_count_to_year | 0 | 1 | 4.31 | 7.79 | 0.00 | 0.00 | 1.00 | 5.00 | 64.0 | ▇▁▁▁▁ |
whitney_count_to_year | 0 | 1 | 1.96 | 5.19 | 0.00 | 0.00 | 0.00 | 0.00 | 40.0 | ▇▁▁▁▁ |
Lets look at our newly made categorical variable “year_cat”
%>% tabyl(year_cat)%>%
artists1 ::gt()%>%
gttab_header(
title = "Number of times various artists are mentioned in the textbooks")
Number of times various artists are mentioned in the textbooks | ||
year_cat | n | percent |
---|---|---|
1921-1940 | 71 | 0.02245414 |
1941-1960 | 147 | 0.04648956 |
1961-1980 | 497 | 0.15717900 |
1981-2000 | 906 | 0.28652751 |
2001-2020 | 1541 | 0.48734978 |
Are the values what you expected for the variables? Why or Why not?
The transformations I performed do appear to have worked correctly. The categorical variable for the year that classifies by 20 year intervals(year_cat) is coded correctly. The variable nationality was also split into two columns. The values for the year_cat also makes sense. It shows that over the years the textbooks became more enriched as more artists were featured in the textbooks. Also, more editions were printed throughout the years.
Visualizing and Summarizing the Data (15 points)
Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question.
We will look at the mean of the variable “space_ratio_per_page_total” and the maximum of the variable “moma_count_to_year” to answer our research question.
#Grouping by the artists’s mean space ratio occupied in books variable
#and total number of MoMA exhibitions
%>%group_by(artist_unique_id, artist_name)%>%
artists1summarize(mean_space_ratio_per_page_total = mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year))%>%
arrange(desc(mean_space_ratio_per_page_total))
## `summarise()` has grouped output by 'artist_unique_id'. You can override using
## the `.groups` argument.
## # A tibble: 413 × 4
## # Groups: artist_unique_id [413]
## artist_unique_id artist_name mean_space_ratio_per…¹ max_m…²
## <dbl> <chr> <dbl> <dbl>
## 1 317 Pablo Picasso 2.54 36
## 2 111 Eugène Delacroix 1.65 1
## 3 318 Paul Cézanne 1.60 39
## 4 90 Édouard Manet 1.55 0
## 5 121 Francisco Goya 1.52 3
## 6 187 Jacques-Louis David 1.43 0
## 7 361 Sigmar Polke 1.17 2
## 8 389 Vincent Van Gogh 1.13 12
## 9 195 Jean Auguste Dominique Ingres 1.06 0
## 10 372 Théodore Géricault 1.06 0
## # … with 403 more rows, and abbreviated variable names
## # ¹mean_space_ratio_per_page_total, ²max_moma_exhibitions
#To find out if the correlation differs by time. Included year_cat.
%>%group_by(artist_unique_id,artist_name)%>%
artists1summarize(mean_space_ratio_per_page_total =
mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year), year_cat)%>%
arrange(desc(mean_space_ratio_per_page_total))%>%
print(n=35)
## `summarise()` has grouped output by 'artist_unique_id', 'artist_name'. You can
## override using the `.groups` argument.
## # A tibble: 3,162 × 5
## # Groups: artist_unique_id, artist_name [413]
## artist_unique_id artist_name mean_space_ratio_per_page…¹ max_m…² year_…³
## <dbl> <chr> <dbl> <dbl> <fct>
## 1 317 Pablo Picasso 2.54 36 1921-1…
## 2 317 Pablo Picasso 2.54 36 1921-1…
## 3 317 Pablo Picasso 2.54 36 1941-1…
## 4 317 Pablo Picasso 2.54 36 1941-1…
## 5 317 Pablo Picasso 2.54 36 1961-1…
## 6 317 Pablo Picasso 2.54 36 1961-1…
## 7 317 Pablo Picasso 2.54 36 1961-1…
## 8 317 Pablo Picasso 2.54 36 1981-2…
## 9 317 Pablo Picasso 2.54 36 1981-2…
## 10 317 Pablo Picasso 2.54 36 1981-2…
## 11 317 Pablo Picasso 2.54 36 2001-2…
## 12 317 Pablo Picasso 2.54 36 2001-2…
## 13 317 Pablo Picasso 2.54 36 2001-2…
## 14 317 Pablo Picasso 2.54 36 2001-2…
## 15 317 Pablo Picasso 2.54 36 2001-2…
## 16 317 Pablo Picasso 2.54 36 2001-2…
## 17 317 Pablo Picasso 2.54 36 1961-1…
## 18 317 Pablo Picasso 2.54 36 1961-1…
## 19 317 Pablo Picasso 2.54 36 1961-1…
## 20 317 Pablo Picasso 2.54 36 1981-2…
## 21 317 Pablo Picasso 2.54 36 1981-2…
## 22 317 Pablo Picasso 2.54 36 1981-2…
## 23 317 Pablo Picasso 2.54 36 2001-2…
## 24 317 Pablo Picasso 2.54 36 2001-2…
## 25 317 Pablo Picasso 2.54 36 2001-2…
## 26 111 Eugène Delacroix 1.65 1 1921-1…
## 27 111 Eugène Delacroix 1.65 1 1921-1…
## 28 111 Eugène Delacroix 1.65 1 1941-1…
## 29 111 Eugène Delacroix 1.65 1 1941-1…
## 30 111 Eugène Delacroix 1.65 1 1961-1…
## 31 111 Eugène Delacroix 1.65 1 1961-1…
## 32 111 Eugène Delacroix 1.65 1 1961-1…
## 33 111 Eugène Delacroix 1.65 1 1981-2…
## 34 111 Eugène Delacroix 1.65 1 1981-2…
## 35 111 Eugène Delacroix 1.65 1 1981-2…
## # … with 3,127 more rows, and abbreviated variable names
## # ¹mean_space_ratio_per_page_total, ²max_moma_exhibitions, ³year_cat
What are your findings about the summary? Are they what you expected?
My expectation was that there would be a correlation between being featured in the textbooks and having being featured in exhibitions at the MoMA. I expected famous artists to be featured both in the books and in the exhibitions at MOMA. But from the data above it seems my assumption might not be accurate. There does not seem to be any apparent trend. As an example, Pablo Picasso is most featured in the books and has had a decent number of exhibitions at MoMA. But Eugène Delacroix who is heavily featured in the books has had only one exhibition. Making scatter plots will further help us to answer this research question.
Make at least two plots that help you answer your question on the transformed or summarized data.
# Make a dataframe of grouping by the artists’s mean space ratio
#occupied in books and total number of MoMA exhibitions.
<-artists1%>%group_by(artist_unique_id)%>%
artists2summarize(mean_space_ratio_per_page_total = mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year))%>%
arrange(desc(mean_space_ratio_per_page_total))
# Making the scatter plot
<-ggplot(artists2, aes(x=mean_space_ratio_per_page_total, y=max_moma_exhibitions)) +
plotgeom_point(size = 1.5, color = "red",shape = 17) +
stat_smooth(method=lm) +
labs(title = "Scatter plot of max no of exhibitions VS mean space ratio
occupied in the textbooks",
x = "Mean space ratio occupied in Janson and Gardner",
y = "Max number of exhibitions at MoMA",
caption = "Source: Art History data from Tidytuesday"
)
+
plot theme_bw() +
theme(axis.text.x = element_text())
## `geom_smooth()` using formula = 'y ~ x'
# Make a dataframe of grouping by the artists’s mean space ratio occupied
#in books and total number of MoMA exhibitions by the categorized year_cat
#variable.
<-artists1%>%group_by(artist_unique_id)%>%
artists3summarize(mean_space_ratio_per_page_total = mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year),artist_name,
%>%arrange(desc(mean_space_ratio_per_page_total)) year_cat)
## `summarise()` has grouped output by 'artist_unique_id'. You can override using
## the `.groups` argument.
# Making the scatter plot
<-ggplot(artists3, aes(x=mean_space_ratio_per_page_total, y=max_moma_exhibitions)) +
plot1geom_point(size = 1, color = "red",shape = 17) +
stat_smooth(method=lm) + facet_wrap("year_cat")+
labs(title = "Scatter plot of max no of exhibitions VS mean space ratio
occupied in the textbooks ",
x = "Mean space ratio occupied in Janson and Gardner",
y = "Max number of exhibitions at MoMA",
caption = "Source: Art History data from Tidytuesday"
)
+
plot1 theme_bw() +
theme(axis.text.x = element_text(angle = 90))
## `geom_smooth()` using formula = 'y ~ x'
Final Summary (10 points)
Summarize your research question and findings below. Are your findings what you expected? Why or Why not?
My research question was to see if there is a correlation between space occupied in Janson’s History of Art and Gardner’s Art Through the Ages, and having being exhibited in the Museum of Modern Arts (MoMA). I was expecting to see a correlation. It is perfectly logical to assume that famous artists will be featured heavily in both MoMA exhibitions and the books. But the data does not support my hypothesis. Both the transformed data and the scatter plots indicate no relationship between having being exhibited in the Museum of Modern Arts (MoMA) and space occupied in the Gardner and Jansen textbooks, meaning that there is no consistent association between the two. Similarly, no association between the two was seen when categorized by various time periods. One explanation might be that the Museum of Modern Arts features less known artists more. Since the museum tends to feature most progressive tendencies in modern art, this is perhaps the most logical explanation.