Midterm

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Research Question:

For this project, I used a dataset from “Tidy Tuesday” called “Art History”. The data in interesting because I love arts first and foremost. Second, because as per “Tidy Tuesday”¹: “The data…assess(es) the demographic representation of artists through editions of Janson’s History of Art and Gardner’s Art Through the Ages, two of the most popular art history textbooks used in the American education system”. Hence, this package is very rich in artists’ demographic information which can answer a lot of questions about racism in art and representation of non-white race art(ists) in art textbooks.

My primary reasearch question is:

How are American artists from different races represented in two of the most popular art history textbooks used in the American education system.

My specific question is:

Whether the representation of non-white races has varied before and after the year 2000. It is known that reverse racism has been increasing since the turn of the new millenium. In the early 2000s, the USA got its first Black president. It can be expected to see more representation of other races in art books as in the whole nation. I will explore whether the representation of non-white artists has increased post 2000 or not.

I will use numeric metrics for describing the space which artists from each race took up in all editions of both textbooks.The area in millimeter squared represent both the text and the figure of a particular artist per single page of a book.

Given your question, what is your expectation about the data?

I am expecting that more White American artists are represented in American art books compared to Americans from other races. There is, however, an expectation that the representation of non-white artists increased after the year 2000.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Reading in the data manually from tidytuesdayR:
artists <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-17/artists.csv', na = "NA")

## Rows: 3162 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): artist_name, artist_nationality, artist_nationality_other, artist_g...
## dbl (6): edition_number, year, space_ratio_per_page_total, artist_unique_id,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Save dataset:
artists%>% write_csv(file= "art_history_data.csv")
#OR
#write_csv(artists,file="art_history_data2023.csv")

#Exploring data:
# View (artists)
glimpse(artists)

## Rows: 3,162
## Columns: 14
## $ artist_name                <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
## $ edition_number             <dbl> 9, 10, 11, 12, 13, 14, 15, 16, 14, 15, 16, …
## $ year                       <dbl> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
## $ artist_nationality         <chr> "American", "American", "American", "Americ…
## $ artist_nationality_other   <chr> "American", "American", "American", "Americ…
## $ artist_gender              <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ artist_race                <chr> "Black or African American", "Black or Afri…
## $ artist_ethnicity           <chr> "Not Hispanic or Latino origin", "Not Hispa…
## $ book                       <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
## $ space_ratio_per_page_total <dbl> 0.3533658, 0.3739470, 0.3032593, 0.3770489,…
## $ artist_unique_id           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 6, 6, 6…
## $ moma_count_to_year         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ whitney_count_to_year      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ artist_race_nwi            <chr> "Non-White", "Non-White", "Non-White", "Non…

skim(artists)

Data summary
Name	artists
Number of rows	3162
Number of columns	14
_______________________
Column type frequency:
character	8
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
artist_name	0	1.00	4	99	413
artist_nationality	0	1.00	3	18	52
artist_nationality_other	0	1.00	5	8	6
artist_gender	0	1.00	3	6	3
artist_race	0	1.00	3	41	6
artist_ethnicity	58	0.98	25	29	2
book	0	1.00	6	7	2
artist_race_nwi	0	1.00	5	9	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
edition_number	1	8.22	4.40	1.00	5.00	8.00	12.00	16.0	▇▇▆▅▆
year	1	1994.24	19.20	1926.00	1986.00	1996.00	2009.00	2020.0	▁▁▃▇▇
space_ratio_per_page_total	1	0.53	0.39	0.09	0.31	0.41	0.59	3.8	▇▁▁▁▁
artist_unique_id	1	201.76	114.18	1.00	108.00	189.00	305.75	413.0	▆▇▇▆▆
moma_count_to_year	1	4.31	7.79	0.00	0.00	1.00	5.00	64.0	▇▁▁▁▁
whitney_count_to_year	1	1.96	5.19	0.00	0.00	0.00	0.00	40.0	▇▁▁▁▁

artists %>% tabyl(artist_nationality)

##  artist_nationality   n      percent
##            American 908 0.2871600253
##           Argentine   1 0.0003162555
##   Armenian-American  10 0.0031625553
##          Australian   7 0.0022137887
##            Austrian  36 0.0113851992
##   Austrian-American   5 0.0015812777
##             Belgian  30 0.0094876660
##           Brazilian   1 0.0003162555
##             British 317 0.1002530044
##            Canadian  14 0.0044275775
##             Chinese   5 0.0015812777
##           Columbian   2 0.0006325111
##           Congolese   5 0.0015812777
##               Cuban   3 0.0009487666
##      Cuban-American   5 0.0015812777
##               Czech   3 0.0009487666
##     Danish-American   6 0.0018975332
##       Danish-French  16 0.0050600886
##               Dutch  50 0.0158127767
##      Dutch-American  18 0.0056925996
##              French 870 0.2751423150
##   French Polynesian   6 0.0018975332
##              German 256 0.0809614168
##     German-American  11 0.0034788109
##       German-French  13 0.0041113219
##           Hungarian  10 0.0031625553
##  Hungarian-American   4 0.0012650221
##    Hungarian-French  10 0.0031625553
##              Indian  13 0.0041113219
##             Iranian   3 0.0009487666
##             Italian  74 0.0234029096
##    Italian-American  10 0.0031625553
##            Japanese  56 0.0177103099
##              Korean   3 0.0009487666
##             Latvian   2 0.0006325111
##             Mexican  52 0.0164452878
##                 N/A  23 0.0072738773
##       New Zealander   4 0.0012650221
##           Norwegian  21 0.0066413662
##  Pakistani-American   3 0.0009487666
##            Peruvian   2 0.0006325111
##          Polynesian   6 0.0018975332
##             Russian  62 0.0196078431
##      Russian-French  16 0.0050600886
##            Scottish  16 0.0050600886
##             Spanish  94 0.0297280202
##             Swedish   5 0.0015812777
##               Swiss  44 0.0139152435
##        Swiss-French   7 0.0022137887
##        Swiss-German  22 0.0069576218
##                Thai   1 0.0003162555
##           Uruguayan   1 0.0003162555

artists %>% tabyl(artist_race)

##                                artist_race    n     percent
##           American Indian or Alaska Native   12 0.003795066
##                                      Asian   79 0.024984187
##                  Black or African American   83 0.026249209
##                                        N/A   29 0.009171410
##  Native Hawaiian or Other Pacific Islander   23 0.007273877
##                                      White 2936 0.928526249

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

vis_dat(artists)

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
##   Please report the issue at <]8;;https://github.com/ropensci/visdat/issueshttps://github.com/ropensci/visdat/issues]8;;>.

#Must assign values coded as "N/A" as `NA`

artists <- artists  %>%  mutate(
  artist_race = na_if(artist_race, "N/A"))%>% drop_na(artist_race)

There is some missingness in artist ethnicity as shown by “skim” and “vis_dat” (n=58). However, it is not a problem because ethnicity not part of the analysis.

There is missingness in artist race that was detected by “View” and “tabyl” functions of the data. Missing values are not showing with either “skim” or “vis_dat” because of the way it is coded (N/A). “N/A” were converted to NA i.e. missing. Since only 4% of Race was missing from data (n=133), I decided to remove them from the analysis.

Make sure your data types are correct!

glimpse(artists)

## Rows: 3,133
## Columns: 14
## $ artist_name                <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
## $ edition_number             <dbl> 9, 10, 11, 12, 13, 14, 15, 16, 14, 15, 16, …
## $ year                       <dbl> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
## $ artist_nationality         <chr> "American", "American", "American", "Americ…
## $ artist_nationality_other   <chr> "American", "American", "American", "Americ…
## $ artist_gender              <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ artist_race                <chr> "Black or African American", "Black or Afri…
## $ artist_ethnicity           <chr> "Not Hispanic or Latino origin", "Not Hispa…
## $ book                       <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
## $ space_ratio_per_page_total <dbl> 0.3533658, 0.3739470, 0.3032593, 0.3770489,…
## $ artist_unique_id           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 6, 6, 6…
## $ moma_count_to_year         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ whitney_count_to_year      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ artist_race_nwi            <chr> "Non-White", "Non-White", "Non-White", "Non…

Data types seem to be correct: string values appear as character and numerical values are doubles.

Character variables include: artist name, artist nationality, artist nationality other, artist gender, artist race, artist ethnicity, book, artist race non-white.
Numerical variables include: edition number, year, space ratio per page total, artist unique id, moma count to year, whitney count to year

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

## Creating a new dataset called "artists_Amer2" in which I am:
# Selecting only American artists,
# Converting area in book from cm2 to mm2 (x100), and 
# Categorizing year to "before_2000" and "2000_andafter".

artists_Amer2<-artists %>% filter(artist_nationality=="American") %>% mutate(bookrep_mm= space_ratio_per_page_total*100) %>% mutate(reverse_racism = case_when(year >=1920 & year<2000 ~ "before_2000",year >=2000 & year<=2020 ~ "2020_andafter")) 
class(artists_Amer2$reverse_racism)

## [1] "character"

#reverse_racism is character.Change reverse_racism to factor
artists_Amer2<-artists_Amer2%>% mutate(reverse_racism = factor(reverse_racism))
class(artists_Amer2$reverse_racism)

## [1] "factor"

levels(artists_Amer2$reverse_racism)

## [1] "2020_andafter" "before_2000"

#need to re-order levels of reverse_racism: bring "before_2000" first
artists_Amer2 <- artists_Amer2%>% mutate(reverse_racism = reverse_racism %>%
           fct_relevel("before_2000"))
levels(artists_Amer2$reverse_racism)

## [1] "before_2000"   "2020_andafter"

## Exploring variables of interest:
tabyl(artists$year) #the frequency of total artist representation per year (distribution)

##  artists$year   n     percent
##          1926  19 0.006064475
##          1936  47 0.015001596
##          1948  60 0.019150974
##          1959  86 0.027449729
##          1963  62 0.019789339
##          1969  76 0.024257900
##          1970  68 0.021704437
##          1975  84 0.026811363
##          1977  90 0.028726460
##          1980 114 0.036386850
##          1986 253 0.080753272
##          1991 311 0.099265879
##          1995 185 0.059048835
##          1996 156 0.049792531
##          2001 353 0.112671561
##          2005 162 0.051707628
##          2007 163 0.052026811
##          2009 160 0.051069263
##          2011 153 0.048834982
##          2013 173 0.055218640
##          2016 179 0.057133738
##          2020 179 0.057133738

tabyl(artists_Amer2$reverse_racism) # the sample seems to be split fairly by timeline

##  artists_Amer2$reverse_racism   n   percent
##                   before_2000 417 0.4607735
##                 2020_andafter 488 0.5392265

Since I am interested only in American (American born) artists, I filtered the data by American nationality. I also converted area per page from cm2 to mm2 by multiplying by 100. Finally, I divided the timeline (years) at the 2000 point; where the year 2000 demarcates (softly) a reverse racism period.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

artists_Amer2 %>% glimpse

## Rows: 905
## Columns: 16
## $ artist_name                <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
## $ edition_number             <dbl> 9, 10, 11, 12, 13, 14, 15, 16, 2, 3, 4, 7, …
## $ year                       <dbl> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
## $ artist_nationality         <chr> "American", "American", "American", "Americ…
## $ artist_nationality_other   <chr> "American", "American", "American", "Americ…
## $ artist_gender              <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ artist_race                <chr> "Black or African American", "Black or Afri…
## $ artist_ethnicity           <chr> "Not Hispanic or Latino origin", "Not Hispa…
## $ book                       <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
## $ space_ratio_per_page_total <dbl> 0.3533658, 0.3739470, 0.3032593, 0.3770489,…
## $ artist_unique_id           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8…
## $ moma_count_to_year         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1…
## $ whitney_count_to_year      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ artist_race_nwi            <chr> "Non-White", "Non-White", "Non-White", "Non…
## $ bookrep_mm                 <dbl> 35.33658, 37.39470, 30.32593, 37.70489, 39.…
## $ reverse_racism             <fct> before_2000, before_2000, 2020_andafter, 20…

Artist nationality is only “American” now. A new column/variable called “bookrep_mm” has been added. A new column/variable called “reverse_racism” has been added.

Are the values what you expected for the variables? Why or Why not?

Yes. “bookrep_mm” which represents the area in squared millimeter (mm2) instead of cm2 is 100x the value of “space_ratio_per_page_total”. “reverse_racism” is a factor of 2 levels created from the variable “year”.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

## Creating another dataset (artists_Amer2_Race) where American artists are grouped by race:
artists_Amer2_Race<-artists_Amer2%>% group_by(artist_race)%>% summarize(mean_bookrep_mm = mean(bookrep_mm, na.rm = TRUE)) 

#Arrange race descending,starting with races having the highest representation in art books downwards
artists_Amer2_Race%>% arrange(desc(mean_bookrep_mm)) %>% gt::gt()

artist_race	mean_bookrep_mm
American Indian or Alaska Native	50.61391
White	41.02466
Black or African American	39.75356
Native Hawaiian or Other Pacific Islander	35.88631
Asian	25.91803

What are your findings about the summary? Are they what you expected?

The summary shows that American Indian or Alaska Native (AIANs) art(ists) have the highest representation in art books, followed by White, Black, Native Hawaiian or Other Pacific Islander, and finally Asian Americans. This is not what I expected. I was expecting that white art(ists) would rank first in representation in the 2 art textbooks.

## Creating another dataset (artists_Amer3_Race) where American artists are grouped by race and reverse_racism:
artists_Amer3_Race<-artists_Amer2%>% group_by(artist_race,reverse_racism)%>% summarize(mean_bookrep_mm = mean(bookrep_mm, na.rm = TRUE))

## `summarise()` has grouped output by 'artist_race'. You can override using the
## `.groups` argument.

#Arrange race descending,starting with races having the highest representation in art books downwards, split by racism era
artists_Amer3_Race%>% arrange(desc(mean_bookrep_mm)) %>% gt::gt()

reverse_racism	mean_bookrep_mm
American Indian or Alaska Native
2020_andafter	51.70879
before_2000	48.69787
White
before_2000	42.03978
2020_andafter	40.07465
Black or African American
2020_andafter	40.01563
before_2000	39.02251
Native Hawaiian or Other Pacific Islander
2020_andafter	35.88631
Asian
2020_andafter	25.91803

What are your findings about the summary? Are they what you expected?

More information is revealed when data was stratified by reverse racism (before and after year 2000).

The representation of American Indian or Alaska Native artists increased after the year 2000. This race still has the highest representation in art books.
The representation of American artists from Asian race was non-existent prior to 2000. The representation of those artists appear to be heavy after the year 2000 since they moved from last place (as a general ranking among races) to the second place, when we stratified by year.
Black and African American artists also have more representation in art books after the year 2000.
Native Hawaiian or Other Pacific Islander artists are represented in art books only after 2000.
The representation of White artists in art books has decreased since 2000.

These findings are what I expected. With reverse racism happening since the turn of the millennium, more racial rights have been acquired and racism is becoming less pronounced. I am glad to see that happen in art books too!

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#Figure 1: Scatter plot:Mean area per page, by artists' race
ggplot(artists_Amer3_Race) + aes(x = artist_race, y = mean_bookrep_mm, color=reverse_racism) + geom_point(alpha = 0.5) + 
  labs(title = "American Artist: Mean Area per Page by Artist's Race",
       y = "Mean Area per Page (mm2)",
       x = "Artist Race", color = "Reverse Racism") + theme(axis.text.x = element_text(angle = 90))

ggsave("Figure1.Midterm.jpg")

## Saving 7 x 6 in image

#Figure 2: Boxplot: Area in book by race, faceted by reverse racism
ggplot(artists_Amer2) + aes(x = artist_race, y = bookrep_mm, fill=artist_race) + geom_boxplot(alpha = 0.2) + facet_wrap(vars(reverse_racism))+ labs(title = "American Artist: Area in Book by Race",
       y = "Area per Page (mm2)",x = "Artist Race",
       fill = "Reverse Racism")+ theme(axis.text.x = element_blank())

ggsave("Figure2.Midterm.jpg")

## Saving 7 x 5 in image

Final Summary (10 points)

Summarize your research question and findings below.

This analysis was conducted to explore the amount of representation of non-white races in Janson’s History of Art and Gardner’s Art Through the Ages, two of the most popular art history textbooks used in the American education system. The area in millimeters squared of both the text and the figure of a particular artist divided by the area in millimeters squared of a single page of the respective edition is used to measure representation of the art(ist) and their race. I stratified the findings by 2 timeline periods (prior to 2000, and after 2000) to see if racial representation changed along years, especially with reverse racism pronounced in the 2000s.

The general representation in art books was in favor of American Indian or Alaska Native race, then White Americans, Black Americans, Pacific Islander or Hawaiian and the least represenation of American artists of Asian race.

When art(ist) representation was further stratified by timeline, I found out that 2 races were not even represented before 2000 (Native Hawaiian or Other Pacific Islander and Asian). There was an increased representation of American Indian or Alaska Native and Black art(ists) after 2000 and a decrease in representation of white art(ists).

Are your findings what you expected? Why or Why not?

The first part of the results (AIANs represented more than white) is not exactly what I expected. I must also say that I have used data on artists who are only American. American artists of another origin (e.g. German-American, French-American..etc) were not included in this analysis. Had they been included, we might have seen other results (but that’s another research). However, the second part of my findings are so exciting. It comes as a nice surprise to see more racial diversity happening after year 2000, in the two most important art textbooks used in American education system.

https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-17/readme.md ↩︎

Midterm

Hoda Mohammed

2023-02-17

Define Your Research Question (10 points)

Loading the Data (10 points)

Transforming the data (15 points)

Visualizing and Summarizing the Data (15 points)

Final Summary (10 points)