Midterm

Farzana Karim

2023-02-19

Define Your Research Question (10 points)


I have always loved arts and history. Visiting the Louvre in Paris was a highlight of my life. This inspired me to work on the “Art History data” from Tidytuesday( https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-17/readme.md). The data was collected to assess the demographic representation of artists through editions of Janson’s History of Art and Gardner’s Art Through the Ages, two of the most popular art history textbooks used in the American education system.

After exploring the data I became interested to find out the correlation between space occupied in the textbooks by the artists and having being exhibited in the Museum of Modern Arts (MoMA). I am curious to see if occupying more space in the two main textbooks corresponds to having greater exhibitions at the MoMA. I am also interested to see how this correlation might be different by different time range. My expectation is that there would be a correlation. I expect famous artists to be both featured in the books and have greater exhibitions at the MoMa.

Loading the Data (10 points)


Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#Data imported from tidytuesday
#artists <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience
#/tidytuesday/master/data/2023/2023-01-17/artists.csv')
#save data to permanent location in data subfolder of project folder 
#write.csv(artists, file = here("data/artists.csv"))
#Uploading from the data folder

artists <- read_csv("data/artists.csv")
## New names:
## Rows: 3162 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): artist_name, artist_nationality, artist_nationality_other, artist_g... dbl
## (7): ...1, edition_number, year, space_ratio_per_page_total, artist_uniq...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Lets use glimpse

glimpse(artists)
## Rows: 3,162
## Columns: 15
## $ ...1                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ artist_name                <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
## $ edition_number             <dbl> 9, 10, 11, 12, 13, 14, 15, 16, 14, 15, 16, …
## $ year                       <dbl> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
## $ artist_nationality         <chr> "American", "American", "American", "Americ…
## $ artist_nationality_other   <chr> "American", "American", "American", "Americ…
## $ artist_gender              <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ artist_race                <chr> "Black or African American", "Black or Afri…
## $ artist_ethnicity           <chr> "Not Hispanic or Latino origin", "Not Hispa…
## $ book                       <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
## $ space_ratio_per_page_total <dbl> 0.3533658, 0.3739470, 0.3032593, 0.3770489,…
## $ artist_unique_id           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 6, 6, 6…
## $ moma_count_to_year         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ whitney_count_to_year      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ artist_race_nwi            <chr> "Non-White", "Non-White", "Non-White", "Non…

Lets skim the data

skim(artists)
Data summary
Name artists
Number of rows 3162
Number of columns 15
_______________________
Column type frequency:
character 8
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
artist_name 0 1.00 4 99 0 413 0
artist_nationality 0 1.00 3 18 0 52 0
artist_nationality_other 0 1.00 5 8 0 6 0
artist_gender 0 1.00 3 6 0 3 0
artist_race 0 1.00 3 41 0 6 0
artist_ethnicity 58 0.98 25 29 0 2 0
book 0 1.00 6 7 0 2 0
artist_race_nwi 0 1.00 5 9 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1 1581.50 912.94 1.00 791.25 1581.50 2371.75 3162.0 ▇▇▇▇▇
edition_number 0 1 8.22 4.40 1.00 5.00 8.00 12.00 16.0 ▇▇▆▅▆
year 0 1 1994.24 19.20 1926.00 1986.00 1996.00 2009.00 2020.0 ▁▁▃▇▇
space_ratio_per_page_total 0 1 0.53 0.39 0.09 0.31 0.41 0.59 3.8 ▇▁▁▁▁
artist_unique_id 0 1 201.76 114.18 1.00 108.00 189.00 305.75 413.0 ▆▇▇▆▆
moma_count_to_year 0 1 4.31 7.79 0.00 0.00 1.00 5.00 64.0 ▇▁▁▁▁
whitney_count_to_year 0 1 1.96 5.19 0.00 0.00 0.00 0.00 40.0 ▇▁▁▁▁

This step shows that the data was loaded correctly. There are eight character variables (artist_name, artist_nationality, artist_nationality_other, artist_gender, artist_race, artist_ethnicity, book and artist_race_nwi) and seven numeric variables(row name, edition number, year, space_ratio_per_page_total, artist_unique_id, moma_count_to_year and whitney_count_to_year). There are 58 missing data in the “artist_ethnicity” variable. Missing data are already coded as “NA” in the original dataset. Some of the nationalities are coded as dual citizenship (e.g German-American). We will split them in two different columns. The variables artist_gender, artist_race and artist_nationality have some data coded as N/A. We will change these to NA. Even though we might not use these variables but it is good practice. We will also categorize the variable “year”. All our data types are already coded accurately. We will convert character vectors to factors as we code.

I just want to mention that all of this code is my own code that I modified from the class materials.

Transforming the data (15 points)


If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Converting N/A to NA

artists <- artists  %>%  mutate(
  artist_gender = na_if(artist_gender, "N/A"))
  
artists <- artists  %>%  mutate(
  artist_race = na_if(artist_race, "N/A"))
  
artists <- artists  %>%  mutate(
  artist_nationality = na_if(artist_nationality, "N/A"))

Making two new variables called nationality1 and nationality2 by splitting the column “artist_nationality”.

artists1<-artists %>%
  separate(col= artist_nationality, into = c("nationality1", "nationality2"), sep = "-")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2983 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Making a new variable called ’year_cat” by categorizing the variable “year” into 20 year intervals.

artists1<- artists1 %>% mutate(year_cat = year) %>%
  mutate(
    year_cat = case_when(
      (year >=1921) & (year<=1940) ~ "1921-1940",
      ( year>=1940) & (year<= 1960) ~ "1941-1960",
      ( year>=1960) & (year<= 1980) ~ "1961-1980",
      ( year>=1980) & (year<= 2000) ~ "1981-2000",
       year>=2001  ~ "2001-2020"
     
    )
   ) %>% mutate(year_cat = factor(year_cat))  

Make sure the data types are coded correctly!

The transformed table should include my new categorized “year_cat” variable as a factor. The variable nationality should be split into two columns. All the N/A s should be transformed to NA. We can use glimpse and skim to make sure everything is accurate.

glimpse(artists1)
## Rows: 3,162
## Columns: 17
## $ ...1                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ artist_name                <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
## $ edition_number             <dbl> 9, 10, 11, 12, 13, 14, 15, 16, 14, 15, 16, …
## $ year                       <dbl> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
## $ nationality1               <chr> "American", "American", "American", "Americ…
## $ nationality2               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ artist_nationality_other   <chr> "American", "American", "American", "Americ…
## $ artist_gender              <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ artist_race                <chr> "Black or African American", "Black or Afri…
## $ artist_ethnicity           <chr> "Not Hispanic or Latino origin", "Not Hispa…
## $ book                       <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
## $ space_ratio_per_page_total <dbl> 0.3533658, 0.3739470, 0.3032593, 0.3770489,…
## $ artist_unique_id           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 6, 6, 6…
## $ moma_count_to_year         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ whitney_count_to_year      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ artist_race_nwi            <chr> "Non-White", "Non-White", "Non-White", "Non…
## $ year_cat                   <fct> 1981-2000, 1981-2000, 2001-2020, 2001-2020,…
skim(artists1)
Data summary
Name artists1
Number of rows 3162
Number of columns 17
_______________________
Column type frequency:
character 9
factor 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
artist_name 0 1.00 4 99 0 413 0
nationality1 23 0.99 4 17 0 39 0
nationality2 3006 0.05 6 8 0 3 0
artist_nationality_other 0 1.00 5 8 0 6 0
artist_gender 58 0.98 4 6 0 2 0
artist_race 29 0.99 5 41 0 5 0
artist_ethnicity 58 0.98 25 29 0 2 0
book 0 1.00 6 7 0 2 0
artist_race_nwi 0 1.00 5 9 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
year_cat 0 1 FALSE 5 200: 1541, 198: 906, 196: 497, 194: 147

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1 1581.50 912.94 1.00 791.25 1581.50 2371.75 3162.0 ▇▇▇▇▇
edition_number 0 1 8.22 4.40 1.00 5.00 8.00 12.00 16.0 ▇▇▆▅▆
year 0 1 1994.24 19.20 1926.00 1986.00 1996.00 2009.00 2020.0 ▁▁▃▇▇
space_ratio_per_page_total 0 1 0.53 0.39 0.09 0.31 0.41 0.59 3.8 ▇▁▁▁▁
artist_unique_id 0 1 201.76 114.18 1.00 108.00 189.00 305.75 413.0 ▆▇▇▆▆
moma_count_to_year 0 1 4.31 7.79 0.00 0.00 1.00 5.00 64.0 ▇▁▁▁▁
whitney_count_to_year 0 1 1.96 5.19 0.00 0.00 0.00 0.00 40.0 ▇▁▁▁▁

Lets look at our newly made categorical variable “year_cat”

artists1 %>% tabyl(year_cat)%>%
   gt::gt()%>%
  tab_header(
    title = "Number of times various artists are mentioned in the textbooks")
Number of times various artists are mentioned in the textbooks
year_cat n percent
1921-1940 71 0.02245414
1941-1960 147 0.04648956
1961-1980 497 0.15717900
1981-2000 906 0.28652751
2001-2020 1541 0.48734978

Are the values what you expected for the variables? Why or Why not?

The transformations I performed do appear to have worked correctly. The categorical variable for the year that classifies by 20 year intervals(year_cat) is coded correctly. The variable nationality was also split into two columns. The values for the year_cat also makes sense. It shows that over the years the textbooks became more enriched as more artists were featured in the textbooks. Also, more editions were printed throughout the years.

Visualizing and Summarizing the Data (15 points)


Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question.

We will look at the mean of the variable “space_ratio_per_page_total” and the maximum of the variable “moma_count_to_year” to answer our research question.

#Grouping by the artists’s mean space ratio occupied in books variable 
#and total number of MoMA exhibitions

artists1%>%group_by(artist_unique_id, artist_name)%>%
summarize(mean_space_ratio_per_page_total = mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year))%>%
  arrange(desc(mean_space_ratio_per_page_total))
## `summarise()` has grouped output by 'artist_unique_id'. You can override using
## the `.groups` argument.
## # A tibble: 413 × 4
## # Groups:   artist_unique_id [413]
##    artist_unique_id artist_name                   mean_space_ratio_per…¹ max_m…²
##               <dbl> <chr>                                          <dbl>   <dbl>
##  1              317 Pablo Picasso                                   2.54      36
##  2              111 Eugène Delacroix                                1.65       1
##  3              318 Paul Cézanne                                    1.60      39
##  4               90 Édouard Manet                                   1.55       0
##  5              121 Francisco Goya                                  1.52       3
##  6              187 Jacques-Louis David                             1.43       0
##  7              361 Sigmar Polke                                    1.17       2
##  8              389 Vincent Van Gogh                                1.13      12
##  9              195 Jean Auguste Dominique Ingres                   1.06       0
## 10              372 Théodore Géricault                              1.06       0
## # … with 403 more rows, and abbreviated variable names
## #   ¹​mean_space_ratio_per_page_total, ²​max_moma_exhibitions
#To find out if the correlation differs by time. Included year_cat.

artists1%>%group_by(artist_unique_id,artist_name)%>%
  summarize(mean_space_ratio_per_page_total = 
              mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year), year_cat)%>%
  arrange(desc(mean_space_ratio_per_page_total))%>%
  print(n=35)
## `summarise()` has grouped output by 'artist_unique_id', 'artist_name'. You can
## override using the `.groups` argument.
## # A tibble: 3,162 × 5
## # Groups:   artist_unique_id, artist_name [413]
##    artist_unique_id artist_name      mean_space_ratio_per_page…¹ max_m…² year_…³
##               <dbl> <chr>                                  <dbl>   <dbl> <fct>  
##  1              317 Pablo Picasso                           2.54      36 1921-1…
##  2              317 Pablo Picasso                           2.54      36 1921-1…
##  3              317 Pablo Picasso                           2.54      36 1941-1…
##  4              317 Pablo Picasso                           2.54      36 1941-1…
##  5              317 Pablo Picasso                           2.54      36 1961-1…
##  6              317 Pablo Picasso                           2.54      36 1961-1…
##  7              317 Pablo Picasso                           2.54      36 1961-1…
##  8              317 Pablo Picasso                           2.54      36 1981-2…
##  9              317 Pablo Picasso                           2.54      36 1981-2…
## 10              317 Pablo Picasso                           2.54      36 1981-2…
## 11              317 Pablo Picasso                           2.54      36 2001-2…
## 12              317 Pablo Picasso                           2.54      36 2001-2…
## 13              317 Pablo Picasso                           2.54      36 2001-2…
## 14              317 Pablo Picasso                           2.54      36 2001-2…
## 15              317 Pablo Picasso                           2.54      36 2001-2…
## 16              317 Pablo Picasso                           2.54      36 2001-2…
## 17              317 Pablo Picasso                           2.54      36 1961-1…
## 18              317 Pablo Picasso                           2.54      36 1961-1…
## 19              317 Pablo Picasso                           2.54      36 1961-1…
## 20              317 Pablo Picasso                           2.54      36 1981-2…
## 21              317 Pablo Picasso                           2.54      36 1981-2…
## 22              317 Pablo Picasso                           2.54      36 1981-2…
## 23              317 Pablo Picasso                           2.54      36 2001-2…
## 24              317 Pablo Picasso                           2.54      36 2001-2…
## 25              317 Pablo Picasso                           2.54      36 2001-2…
## 26              111 Eugène Delacroix                        1.65       1 1921-1…
## 27              111 Eugène Delacroix                        1.65       1 1921-1…
## 28              111 Eugène Delacroix                        1.65       1 1941-1…
## 29              111 Eugène Delacroix                        1.65       1 1941-1…
## 30              111 Eugène Delacroix                        1.65       1 1961-1…
## 31              111 Eugène Delacroix                        1.65       1 1961-1…
## 32              111 Eugène Delacroix                        1.65       1 1961-1…
## 33              111 Eugène Delacroix                        1.65       1 1981-2…
## 34              111 Eugène Delacroix                        1.65       1 1981-2…
## 35              111 Eugène Delacroix                        1.65       1 1981-2…
## # … with 3,127 more rows, and abbreviated variable names
## #   ¹​mean_space_ratio_per_page_total, ²​max_moma_exhibitions, ³​year_cat

What are your findings about the summary? Are they what you expected?

My expectation was that there would be a correlation between being featured in the textbooks and having being featured in exhibitions at the MoMA. I expected famous artists to be featured both in the books and in the exhibitions at MOMA. But from the data above it seems my assumption might not be accurate. There does not seem to be any apparent trend. As an example, Pablo Picasso is most featured in the books and has had a decent number of exhibitions at MoMA. But Eugène Delacroix who is heavily featured in the books has had only one exhibition. Making scatter plots will further help us to answer this research question.

Make at least two plots that help you answer your question on the transformed or summarized data.
# Make a dataframe of grouping by the artists’s mean space ratio
#occupied in books and total number of MoMA exhibitions.

artists2<-artists1%>%group_by(artist_unique_id)%>%
  summarize(mean_space_ratio_per_page_total = mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year))%>%
  arrange(desc(mean_space_ratio_per_page_total))


# Making the  scatter plot

plot<-ggplot(artists2, aes(x=mean_space_ratio_per_page_total, y=max_moma_exhibitions)) +
  geom_point(size = 1.5, color = "red",shape = 17) +
  stat_smooth(method=lm) +
  
  labs(title = "Scatter plot of max no of exhibitions VS mean space ratio 
occupied in the textbooks",
       x = "Mean space ratio occupied in Janson and Gardner",
       y = "Max number of exhibitions at MoMA", 
       caption = "Source: Art History data from Tidytuesday"
       )


plot + 
  theme_bw() +
  theme(axis.text.x = element_text())
## `geom_smooth()` using formula = 'y ~ x'

# Make a dataframe of grouping by the artists’s mean space ratio occupied
#in books and total number of MoMA exhibitions by the categorized year_cat 
#variable.

artists3<-artists1%>%group_by(artist_unique_id)%>%
  summarize(mean_space_ratio_per_page_total = mean(space_ratio_per_page_total),
max_moma_exhibitions = max(moma_count_to_year),artist_name,
year_cat)%>%arrange(desc(mean_space_ratio_per_page_total))
## `summarise()` has grouped output by 'artist_unique_id'. You can override using
## the `.groups` argument.
# Making the  scatter plot

plot1<-ggplot(artists3, aes(x=mean_space_ratio_per_page_total, y=max_moma_exhibitions)) +
  geom_point(size = 1, color = "red",shape = 17) +
  stat_smooth(method=lm) + facet_wrap("year_cat")+
  
  labs(title = "Scatter plot of max no of exhibitions VS mean space ratio 
occupied in the textbooks ",
       x = "Mean space ratio occupied in Janson and Gardner",
       y = "Max number of exhibitions at MoMA", 
       caption = "Source: Art History data from Tidytuesday"
       )

plot1 + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) 
## `geom_smooth()` using formula = 'y ~ x'

Final Summary (10 points)


Summarize your research question and findings below. Are your findings what you expected? Why or Why not?

My research question was to see if there is a correlation between space occupied in Janson’s History of Art and Gardner’s Art Through the Ages, and having being exhibited in the Museum of Modern Arts (MoMA). I was expecting to see a correlation. It is perfectly logical to assume that famous artists will be featured heavily in both MoMA exhibitions and the books. But the data does not support my hypothesis. Both the transformed data and the scatter plots indicate no relationship between having being exhibited in the Museum of Modern Arts (MoMA) and space occupied in the Gardner and Jansen textbooks, meaning that there is no consistent association between the two. Similarly, no association between the two was seen when categorized by various time periods. One explanation might be that the Museum of Modern Arts features less known artists more. Since the museum tends to feature most progressive tendencies in modern art, this is perhaps the most logical explanation.