Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

The Data

This data set was downloaded from kaggle.com [https://www.kaggle.com/datasets/rtatman/animal-bites]. Over 9,000 bites were recorded which occurred near Louisville, Kentucky between 1985 through 2017. It consists of 15 variables. The following are the variable names and description of the variables from the original data set that I will use in this project:

bite_date: The date the bite occurred SpeciesIDDesc: The species of animal that did the biting GenderIDDesc: Gender (of the animal)

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Getting bitten by an animal can lead to exposure to rabies and serious injury. This data set of animal bites can be used to get informed about the animals that bite people most often, and the number of bites that occur yearly. The information can inspire caution to be taken around animals that are known to bite often. For this project, I aim to address the following two research questions:

  1. How do the number of bites differ based on gender and species of the animal?

  2. Is there a trend in number of bites over a period of time?

Given your question, what is your expectation about the data?

There are several species and many observations recorded in this data set. In terms of gender, it is difficult to say if there will be a difference in the number of bites. Specifically for species, since many people own dogs and cats, I expect to to see a larger portion of bites recorded for these animals. For the second question, because of the long period (1985-2021) over which the data has been collected, I expect to see a varying number of bites over this period of time.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Import the data
Health_AnimalBites <- read_excel("data/Health_AnimalBites.xlsx")

# Glimpse and skim the data to explore the data set composition and distribution
glimpse(Health_AnimalBites)
## Rows: 9,003
## Columns: 15
## $ bite_date         <chr> "1985-05-05", "1986-02-12", "1987-05-07", "1988-10-0…
## $ SpeciesIDDesc     <chr> "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DO…
## $ BreedIDDesc       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ GenderIDDesc      <chr> "FEMALE", "UNKNOWN", "UNKNOWN", "MALE", "FEMALE", "U…
## $ color             <chr> "LIG. BROWN", "BRO & BLA", NA, "BLA & BRO", "BLK-WHT…
## $ vaccination_yrs   <dbl> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA…
## $ vaccination_date  <chr> "1985-06-20", NA, NA, NA, NA, NA, "1990-02-13", NA, …
## $ victim_zip        <chr> "40229", "40218", "40219", NA, NA, "40211", "40203",…
## $ AdvIssuedYNDesc   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO"…
## $ WhereBittenIDDesc <chr> "BODY", "BODY", "BODY", "BODY", "BODY", "BODY", "BOD…
## $ quarantine_date   <chr> "1985-05-05", "1986-02-12", "1990-05-07", "1990-10-0…
## $ DispositionIDDesc <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN…
## $ head_sent_date    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ release_date      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ResultsIDDesc     <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN…
      # there are 9,003 rows, and 15 columns
skim(Health_AnimalBites)
Data summary
Name Health_AnimalBites
Number of rows 9003
Number of columns 15
_______________________
Column type frequency:
character 12
logical 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
bite_date 317 0.96 10 10 0 2702 0
SpeciesIDDesc 118 0.99 3 7 0 9 0
GenderIDDesc 2526 0.72 4 7 0 3 0
color 2576 0.71 2 10 0 713 0
vaccination_date 4888 0.46 10 10 0 2107 0
victim_zip 1838 0.80 4 10 0 233 0
AdvIssuedYNDesc 6438 0.28 2 3 0 2 0
WhereBittenIDDesc 616 0.93 4 7 0 3 0
quarantine_date 6983 0.22 10 10 0 602 0
DispositionIDDesc 7468 0.17 4 8 0 4 0
head_sent_date 8608 0.04 10 10 0 325 0
ResultsIDDesc 7460 0.17 7 8 0 3 0

Variable type: logical

skim_variable n_missing complete_rate mean count
BreedIDDesc 9003 0 NaN :
release_date 9003 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
vaccination_yrs 5265 0.42 1.45 0.85 1 1 1 1 11 ▇▁▁▁▁
      # there are 12 character variables, 2 logical variables, and 1 numeric variable

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

The cells where there is missing data are blank, so no additional options are needed when importing the data. The names of the variables should be cleaned. Additionally, the date variables should be split into year, month and date variables. Then converted into numerical variables.

We can change these in two steps before transforming the variables:

  1. The names of the variables should be cleaned.
# Clean the names of all columns in the data set
bites_cleaned <- clean_names(Health_AnimalBites)  %>% 
  glimpse()
## Rows: 9,003
## Columns: 15
## $ bite_date            <chr> "1985-05-05", "1986-02-12", "1987-05-07", "1988-1…
## $ species_id_desc      <chr> "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", …
## $ breed_id_desc        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ gender_id_desc       <chr> "FEMALE", "UNKNOWN", "UNKNOWN", "MALE", "FEMALE",…
## $ color                <chr> "LIG. BROWN", "BRO & BLA", NA, "BLA & BRO", "BLK-…
## $ vaccination_yrs      <dbl> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA,…
## $ vaccination_date     <chr> "1985-06-20", NA, NA, NA, NA, NA, "1990-02-13", N…
## $ victim_zip           <chr> "40229", "40218", "40219", NA, NA, "40211", "4020…
## $ adv_issued_yn_desc   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ where_bitten_id_desc <chr> "BODY", "BODY", "BODY", "BODY", "BODY", "BODY", "…
## $ quarantine_date      <chr> "1985-05-05", "1986-02-12", "1990-05-07", "1990-1…
## $ disposition_id_desc  <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKN…
## $ head_sent_date       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ release_date         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ results_id_desc      <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKN…
  1. Split the variable, bite_date, into year, month and date variables. Then convert them into numerical variables.
# Separate the date variables needed for analysis: bite_date, vaccination_date
# Converting them into numeric variables as well if they aren't already

bites_cleaned <- bites_cleaned %>% 
  
  separate(col = bite_date,
           into = c("bite_year", "bite_month", "bite_day"),
           sep = "-",
           remove = FALSE)

  # View a few observations to confirm that the variable was properly split
bites_cleaned  %>% 
    select(bite_date, bite_year, bite_month, bite_day) %>% # just show these columns
  slice(1:20) # show first 20 rows
## # A tibble: 20 × 4
##    bite_date  bite_year bite_month bite_day
##    <chr>      <chr>     <chr>      <chr>   
##  1 1985-05-05 1985      05         05      
##  2 1986-02-12 1986      02         12      
##  3 1987-05-07 1987      05         07      
##  4 1988-10-02 1988      10         02      
##  5 1989-08-29 1989      08         29      
##  6 1989-11-24 1989      11         24      
##  7 1990-02-08 1990      02         08      
##  8 1990-02-22 1990      02         22      
##  9 1990-08-02 1990      08         02      
## 10 1990-08-19 1990      08         19      
## 11 1990-08-31 1990      08         31      
## 12 1990-10-20 1990      10         20      
## 13 1991-02-09 1991      02         09      
## 14 1991-07-05 1991      07         05      
## 15 1991-09-14 1991      09         14      
## 16 1991-10-09 1991      10         09      
## 17 1991-11-07 1991      11         07      
## 18 1992-02-08 1992      02         08      
## 19 1992-02-27 1992      02         27      
## 20 1992-03-06 1992      03         06
  # Check the class of the variable
class(bites_cleaned$bite_year)
## [1] "character"
class(bites_cleaned$bite_month)
## [1] "character"
class(bites_cleaned$bite_day)
## [1] "character"
  # Since it is a character, we convert it to the desired numeric format
bites_cleaned$bite_year <- as.numeric(bites_cleaned$bite_year)
bites_cleaned$bite_month <- as.numeric(bites_cleaned$bite_month)
bites_cleaned$bite_day <- as.numeric(bites_cleaned$bite_day)

  # Confirm that it was successfully changed to numeric
class(bites_cleaned$bite_year)
## [1] "numeric"
class(bites_cleaned$bite_month)
## [1] "numeric"
class(bites_cleaned$bite_day)
## [1] "numeric"

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Due to the long period of time for which the data was collected, creating a decade variable will facilitate analysis. The variable will be based on the range of the data and the bite_date variable to answer the research questions.

  1. First determine the range of the data using summarize().
# determine the range of bite_year
bites_cleaned %>%
  arrange(bite_year) %>% # use this to confirm that the output is correct by viewing the table
    summarize(min(bite_year, na.rm = TRUE), max(bite_year, na.rm = TRUE)) 
## # A tibble: 1 × 2
##   `min(bite_year, na.rm = TRUE)` `max(bite_year, na.rm = TRUE)`
##                            <dbl>                          <dbl>
## 1                           1952                           5013
# the range identified the max as "5013", which is beyond present day, so we filter to identify possible typos or misreading of the date
          # install package for date format, 'ymd'
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
          # mutate and filter
bites_cleaned %>% mutate(bite_date = ymd(bite_date)) %>% filter(bite_date > ymd("2023-01-01"))
## # A tibble: 5 × 18
##   bite_date  bite_year bite_month bite_day speci…¹ breed…² gende…³ color vacci…⁴
##   <date>         <dbl>      <dbl>    <dbl> <chr>   <lgl>   <chr>   <chr>   <dbl>
## 1 2101-02-18      2101          2       18 CAT     NA      FEMALE  BLACK      NA
## 2 5013-07-15      5013          7       15 DOG     NA      FEMALE  WHITE       1
## 3 2201-01-21      2201          1       21 CAT     NA      MALE    GRAY       NA
## 4 2201-02-21      2201          2       21 DOG     NA      MALE    TAN …       1
## 5 2201-05-01      2201          5        1 DOG     NA      MALE    BROWN       1
## # … with 9 more variables: vaccination_date <chr>, victim_zip <chr>,
## #   adv_issued_yn_desc <chr>, where_bitten_id_desc <chr>,
## #   quarantine_date <chr>, disposition_id_desc <chr>, head_sent_date <chr>,
## #   release_date <lgl>, results_id_desc <chr>, and abbreviated variable names
## #   ¹​species_id_desc, ²​breed_id_desc, ³​gender_id_desc, ⁴​vaccination_yrs
# remove all rows/observations with the identified years beyond 2023
bites_cleaned <- bites_cleaned %>%
  filter(bite_year < '2023')
          # remaining observations = 8681

# confirm that the rows were removed
bites_cleaned %>% 
  filter(vaccination_date > ymd("2023-01-01")) 
## # A tibble: 0 × 18
## # … with 18 variables: bite_date <chr>, bite_year <dbl>, bite_month <dbl>,
## #   bite_day <dbl>, species_id_desc <chr>, breed_id_desc <lgl>,
## #   gender_id_desc <chr>, color <chr>, vaccination_yrs <dbl>,
## #   vaccination_date <chr>, victim_zip <chr>, adv_issued_yn_desc <chr>,
## #   where_bitten_id_desc <chr>, quarantine_date <chr>,
## #   disposition_id_desc <chr>, head_sent_date <chr>, release_date <lgl>,
## #   results_id_desc <chr>
bites_cleaned%>% 
  summarize(min(bite_year, na.rm = TRUE), max(bite_year, na.rm = TRUE)) 
## # A tibble: 1 × 2
##   `min(bite_year, na.rm = TRUE)` `max(bite_year, na.rm = TRUE)`
##                            <dbl>                          <dbl>
## 1                           1952                           2021

The data for bite_date ranges from 1952 through 2021.

  1. Create the variable “decade” using mutate().
# using case_when() within mutate(), we can create a categorical variable for each decade within the data set using the bite_year variable
bites_cleaned <- bites_cleaned %>%
  mutate(
    decade = case_when(
      bite_year < 1959 ~ "50's",
      bite_year >= 1960 & bite_year <= 1969 ~ "60's",
      bite_year >= 1970 & bite_year <= 1979 ~ "70's",
      bite_year >= 1980 & bite_year <= 1989 ~ "80's",
      bite_year >= 1990 & bite_year <= 1999 ~ "90's",
      bite_year >= 2000 & bite_year <= 2009 ~ "2000's",
      bite_year >= 2010 & bite_year <= 2019 ~ "2010's",
      bite_year >= 2020 ~ "2020's")
    )

# make decade a factor variable
bites_cleaned %>%
  mutate(decade = 
           factor(decade, 
                  levels = c("50's", "60's", "70's", "80's", "90's", "2000's", "2010's", "2020's")
                  )
         ) %>%
    # view to confirm order of the categories is correct
  tabyl(decade)
##  decade    n      percent
##    50's    2 0.0002303882
##    60's    0 0.0000000000
##    70's    0 0.0000000000
##    80's    6 0.0006911646
##    90's   36 0.0041469877
##  2000's   17 0.0019582997
##  2010's 8618 0.9927427716
##  2020's    2 0.0002303882

Since the 2010’s has a significant amount of bites recorded (8,618), I will only use this decade in particular for analysis.

I will create a data set for only the 2010’s.

# Use filter() to subset the bites_cleaned data to the object bites_2010
bites_2010 <- bites_cleaned %>%
     filter(decade == "2010's")
  1. Create a variable to be used to show the number of bites per year between 2010 and 2019.
# use group_by() and mutate() to create the count variable, `num_bites`
bites_2010 <- bites_2010 %>%
    group_by(bite_year) %>% 
  mutate(num_bites = n())

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

There were no tables to merge for this data set.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

# use glimpse() to check whether the data is ready for analysis
bites_2010 %>% 
  glimpse() %>% 
  select(1:5, 7, 19, 20) # view only the variables of interest
## Rows: 8,618
## Columns: 20
## Groups: bite_year [9]
## $ bite_date            <chr> "2010-01-01", "2010-01-02", "2010-01-02", "2010-0…
## $ bite_year            <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2…
## $ bite_month           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ bite_day             <dbl> 1, 2, 2, 2, 2, 2, 3, 4, 4, 5, 5, 6, 7, 7, 7, 8, 8…
## $ species_id_desc      <chr> "DOG", "DOG", "DOG", "CAT", "DOG", "DOG", "CAT", …
## $ breed_id_desc        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ gender_id_desc       <chr> "FEMALE", "MALE", "UNKNOWN", "FEMALE", "UNKNOWN",…
## $ color                <chr> "WHT", "BLK-BRN", NA, NA, "BLK", "BRN-WHT", "BLK-…
## $ vaccination_yrs      <dbl> 1, 3, NA, NA, NA, 1, NA, 1, 1, 1, 1, 1, 3, 3, 1, …
## $ vaccination_date     <chr> "2009-10-22", "2008-02-07", NA, NA, NA, "2010-01-…
## $ victim_zip           <chr> "40228", "40291", "40219", "40291", "40216", "400…
## $ adv_issued_yn_desc   <chr> "NO", "NO", "YES", "NO", "NO", "NO", "YES", "NO",…
## $ where_bitten_id_desc <chr> "BODY", "HEAD", "BODY", "BODY", "BODY", "HEAD", "…
## $ quarantine_date      <chr> "2010-01-04", "2010-01-04", "2010-01-04", "2010-0…
## $ disposition_id_desc  <chr> "RELEASED", "RELEASED", "UNKNOWN", "RELEASED", "U…
## $ head_sent_date       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ release_date         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ results_id_desc      <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKN…
## $ decade               <chr> "2010's", "2010's", "2010's", "2010's", "2010's",…
## $ num_bites            <int> 1131, 1131, 1131, 1131, 1131, 1131, 1131, 1131, 1…
## # A tibble: 8,618 × 8
## # Groups:   bite_year [9]
##    bite_date  bite_year bite_month bite_day species_id_…¹ gende…² decade num_b…³
##    <chr>          <dbl>      <dbl>    <dbl> <chr>         <chr>   <chr>    <int>
##  1 2010-01-01      2010          1        1 DOG           FEMALE  2010's    1131
##  2 2010-01-02      2010          1        2 DOG           MALE    2010's    1131
##  3 2010-01-02      2010          1        2 DOG           UNKNOWN 2010's    1131
##  4 2010-01-02      2010          1        2 CAT           FEMALE  2010's    1131
##  5 2010-01-02      2010          1        2 DOG           UNKNOWN 2010's    1131
##  6 2010-01-02      2010          1        2 DOG           FEMALE  2010's    1131
##  7 2010-01-03      2010          1        3 CAT           UNKNOWN 2010's    1131
##  8 2010-01-04      2010          1        4 DOG           FEMALE  2010's    1131
##  9 2010-01-04      2010          1        4 DOG           MALE    2010's    1131
## 10 2010-01-05      2010          1        5 DOG           MALE    2010's    1131
## # … with 8,608 more rows, and abbreviated variable names ¹​species_id_desc,
## #   ²​gender_id_desc, ³​num_bites
# View a table of the number of bites by year to see how the observations are distributed
bites_2010 %>% 
  tabyl(bite_year, num_bites)
##  bite_year 1 1051 1131 1145 1148 1176 1180 801 985
##       2010 0    0 1131    0    0    0    0   0   0
##       2011 0    0    0    0 1148    0    0   0   0
##       2012 0    0    0    0    0    0 1180   0   0
##       2013 0    0    0 1145    0    0    0   0   0
##       2014 0    0    0    0    0 1176    0   0   0
##       2015 0    0    0    0    0    0    0   0 985
##       2016 0 1051    0    0    0    0    0   0   0
##       2017 0    0    0    0    0    0    0 801   0
##       2018 1    0    0    0    0    0    0   0   0

The categories for the variable, decade, correspond to the values of bite_year, and the table of num_bites variable with the bite_year show a distribution of observations that make sense. Thus, the data is ready for analysis.

Are the values what you expected for the variables? Why or Why not?

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

To answer the first question, we can use the group_by and the summarize() function on the variables for gender and species to give us the number of bites per category of each.

# make gender_id_desc a factor variable, arrange the levels, and rename it "gender"
bites_2010 <- bites_2010 %>%
  mutate(gender = factor(gender_id_desc, levels = c("FEMALE", "MALE"))) 

# use group_by() and summarize() to answer question # 1
bites_2010 %>%
  group_by(gender) %>%
  summarize(num_bites = n())
## # A tibble: 3 × 2
##   gender num_bites
##   <fct>      <int>
## 1 FEMALE      1979
## 2 MALE        3763
## 3 <NA>        2876
# make species_id_desc a factor variable, arrange the levels, and rename it "species"
bites_2010 <- bites_2010 %>%
  mutate(species = factor(species_id_desc))

# use group_by() and summarize() to answer question # 2
bites_2010 %>%
  group_by(species_id_desc) %>%
  summarize(num_bites = n())
## # A tibble: 10 × 2
##    species_id_desc num_bites
##    <chr>               <int>
##  1 BAT                    76
##  2 CAT                  1527
##  3 DOG                  6872
##  4 FERRET                  4
##  5 HORSE                   5
##  6 OTHER                   8
##  7 RABBIT                  3
##  8 RACCOON                21
##  9 SKUNK                   1
## 10 <NA>                  101

What are your findings about the summary? Are they what you expected?

After grouping the data by gender (female/male), the number of bites for female animals were observed to be 1,979. For male animals, they were 3,763. After grouping the data by species (bat/cat/dog/ferret/horse/other/rabbit/raccoon/skunk), the number of bites were highest among dogs (6,872), followed by cats (1,527), bats (76), raccoon (21), others (8), horses(5), ferrets(4), rabbits(3), and skunks(1).

Finding that male animals were observed to bite 1,784 more people than female animals was surprising. Though, there is 2,876 missing observations for gender which may have potentially change the distribution since it is a large proportion of the data. For species, it is not surprising that there are many more recorded bites among dogs and cats since they are very common to have as household pets. Though, not nearly as large in proportion, it is surprising to see that there are many recorded incidents of bites from bats.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

The following will show visualizations of the number of bites over the chosen period of time (2010’s), as well as, the distributions of the number of bites that occurred in each year, by gender and then by species.

# create a histogram of bite_year filled to gender
gender_bites <- ggplot(bites_2010) +
  
  aes(x = bite_year,
      fill = gender) +

  geom_histogram() +
  
  scale_fill_manual(values = c("orange", "yellow")) +
  
 labs(title = "Number of Bites by Year and Gender",
      x = "Year of Bite Occurrence",
      y = "Number of Bites")

    #output
gender_bites
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# create a histogram of bite_year filled to species
species_bites <- ggplot(bites_2010) +

  aes(x = bite_year,
      fill = species) +
  
  geom_histogram() +
  
  scale_fill_brewer(palette = "Paired") +
  
 labs(title = "Number of Bites by Year and Species",
      x = "Year of Bite Occurrence",
      y = "Number of Bites")

    #output
species_bites
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# create a line graph object for bite_year by num_bites
line_graph <- ggplot(bites_2010) +
  
  aes(x = bite_year,
      y = num_bites) +

  geom_line() +
  
 labs(title = "Number of Bites by Year",
      x = "Year of Bite Occurrence",
      y = "Number of Bites")

    #output
line_graph

Final Summary (10 points)

Summarize your research question and findings below.

The first research question I posed was regarding the difference in the number of bites across categories of gender and categories of species of the animals included in the data set. Based on the histogram of the number of bites per year, by category of gender, we see that the number of bites of both female and male animals across each year is very similar. Male animals have a noticeably higher number of bites recorded compared to females as well. Based on the histogram of the number of bites per year, by categories of species, we can see that dogs have an overwhelmingly higher number of recorded bites in comparison to other animals. These are confirmed with the summary tables that I explored earlier in the analysis. In comparison to dogs, and cats, the rest of the species had very little recorded occurrences of bites.

To answer the second research question and in order to see the trend more clearly, I plotted another graph solely focusing on the number of bites per year in the 2010’s. From the line graph, you can see that the number of recorded bites were high (about 1100 bites) per year and with little variation between 2010 and 2013. After 2014, the number of bites started to decline rapidly. In 2018, there was only one recorded bite which explains the negative trend through this point. From observing that there are very few bites in other decades within this data set, it is also possible that other bites occurring in 2018 were not recorded.

Overall the information I extracted from this data set is helpful to see which animals more frequently bite people, and what other animals people are prone to being bit from. For the years that also have a lot of observations recorded (2010 - 2016), it is worth noting that the number of bites remain relatively consistent, and without any surprising spikes.

Are your findings what you expected? Why or Why not?

The number of bites varied greatly across gender and species, as well as, over time as I expected. Though, I did observe a lot of missing data within the gender variable and some dates which were not entered in correctly, which was surprising. There may be other misclassified observations which I did not come across and that may have skewed the results. Additionally, I did not expect that the bulk of the recorded bites would have occurred between 2010 and 2017.