Midterm (Due Sunday 2/19/2023 at 11:55 pm)

I am using the “Pet Cats UK” dataset, available for download from Tidy Tuesday (https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-31/readme.md), originally from “Movebank for Animal Tracking Data.” The following is a description of the data from the Tidy Tuesday article:

Between 2013 and 2017, Roland Kays et al. convinced hundreds of volunteers in the U.S., U.K., Australia, and New Zealand to strap GPS sensors on their pet cats. The aforelinked datasets include each cat’s characteristics (such as age, sex, neuter status, hunting habits) and time-stamped GPS pings.

We are focusing on the data from the UK since that was featured in the Tidy Tuesday package.

Citations for the original article: Kays R, Dunn RR, Parsons AW, Mcdonald B, Perkins T, Powers S, Shell L, McDonald JL, Cole H, Kikillus H, Woods L, Tindle H, Roetman P (2020) The small home ranges and large local ecological impacts of pet cats. Animal Conservation. doi:10.1111/acv.12563

… and the Movebank data package: McDonald JL, Cole H (2020) Data from: The small home ranges and large local ecological impacts of pet cats [United Kingdom]. Movebank Data Repository. doi:10.5441/001/1.pf315732

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

This dataset, which is actually in two parts, gives both “demographic” information about each cat (cats_uk_reference) and the GPS log data for those same cats (cats_uk). The reference dataset has many interesting variables for each cat, including age, sex, hours spent indoors, type of food, etc. The most interesting variable to me is prey_p_month, which is each owner’s report of how many prey their cats bring home each month.

As a pet parent myself, I am naturally interested in this type of data. My cat, Archimedes, only goes outside when supervised on a leash, so he is not allowed to hunt. He lives indoors not only for his own safety but also to minimize the impact on local wildlife/birds. I’ve heard that cats allowed to go outside do indeed kill many birds (and apparently it’s not always for sustenance, but for sport… eep!). I am also interested in cat tracking in general, as I recently was debating between a GPS and RF device to locate my cat in case he were to get lost (I ultimately chose the RF tracker for unrelated reasons).

This being said, I want to explore what factors might be related to the number of prey a cat brings home each month. Is it random, or does it seem to be associated with another factor? Since I am exploring the dataset, I’d like to look at how a few different variables each (separately) impact prey_p_month:

  1. Is the typical time of day each cat hunts associated with how many prey they get per month? (modified from timestamp in cats_uk)

  2. Do cats who get wet food bring home less prey per month? (from food_wet in cats_uk_reference)

  3. Is there a difference in prey per month between male and female cats? (from animal_sex in cats_uk_reference)

  4. Is there a relationship between prey per month and number of hours the cat spends indoors? (from hrs_indoors in cats_uk_reference)

Given your question, what is your expectation about the data?

  1. I expect that there could be some association between time of day for hunting and how many prey are brought home per month, but I have no idea what time of day would be associated with more or fewer prey.

  2. I suspect that if a cat is given wet food they might be less inclined to bring home as much prey (unless, of course, it’s just for fun…).

  3. I think perhaps male cats would bring home more prey, but that’s just based on hearsay and stereotypes.

  4. I would think that cats who spend more time indoors would bring home less prey overall (unless they get super efficient? haha).

(I also realize that it’s quite likely there will be no discernible associations between some of these variables. I’m just curious and having fun!)

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

(Please note: Since my data is accessible through the tidytuesdayR package, I am not directly downloading the dataset and therefore am not able to upload it into the data directory. I also used the code listed here (https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-31/readme.md#get-the-data-here) to read in these datasets.)

#install.packages("tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2023-01-31')
## --- Compiling #TidyTuesday Information for 2023-01-31 ----
## --- There are 2 files available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 2: `cats_uk.csv`
##  Downloading file 2 of 2: `cats_uk_reference.csv`
## --- Download complete ---
cats_uk <- tuesdata$cats_uk
cats_uk_reference <- tuesdata$cats_uk_reference

glimpse(cats_uk)
## Rows: 18,215
## Columns: 11
## $ tag_id                   <chr> "Ares", "Ares", "Ares", "Ares", "Ares", "Ares…
## $ event_id                 <dbl> 3395610551, 3395610552, 3395610553, 339561055…
## $ visible                  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ timestamp                <dttm> 2017-06-24 01:03:57, 2017-06-24 01:11:20, 20…
## $ location_long            <dbl> -5.113851, -5.113851, -5.113730, -5.113774, -…
## $ location_lat             <dbl> 50.17032, 50.17032, 50.16988, 50.16983, 50.17…
## $ ground_speed             <dbl> 684, 936, 2340, 0, 4896, 504, 108, 504, 252, …
## $ height_above_ellipsoid   <dbl> 154.67, 154.67, 81.35, 67.82, 118.03, 123.07,…
## $ algorithm_marked_outlier <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ manually_marked_outlier  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ study_name               <chr> "Pet Cats United Kingdom", "Pet Cats United K…
glimpse(cats_uk_reference)
## Rows: 101
## Columns: 16
## $ tag_id                        <chr> "Tommy-Tag", "Athena", "Ares", "Lola", "…
## $ animal_id                     <chr> "Tommy", "Athena", "Ares", "Lola", "Mave…
## $ animal_taxon                  <chr> "Felis catus", "Felis catus", "Felis cat…
## $ deploy_on_date                <dttm> 2017-06-03 01:02:09, 2017-06-24 01:02:1…
## $ deploy_off_date               <dttm> 2017-06-10 02:10:52, 2017-06-30 23:59:3…
## $ hunt                          <lgl> TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, …
## $ prey_p_month                  <dbl> 12.5, 3.0, 0.0, 3.0, 3.0, 3.0, 3.0, 17.5…
## $ animal_reproductive_condition <chr> "Neutered", "Spayed", "Neutered", "Spaye…
## $ animal_sex                    <chr> "m", "f", "m", "f", "m", "f", "m", "m", …
## $ hrs_indoors                   <dbl> 12.5, 7.5, 7.5, 17.5, 12.5, 12.5, 12.5, …
## $ n_cats                        <dbl> 2, 2, 2, 1, 1, 2, 3, 4, 2, 2, 1, 1, 1, 2…
## $ food_dry                      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ food_wet                      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ food_other                    <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, …
## $ study_site                    <chr> "UK", "UK", "UK", "UK", "UK", "UK", "UK"…
## $ age_years                     <dbl> 11, 3, 3, 10, 7, 7, 6, 2, 4, 4, 8, 1, 8,…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Fortunately, the datasets that were provided on Tidy Tuesday were already cleaned, so the data types are correct and NA’s are coded correctly (yay!). However, there are two different tables in this dataset, and I do want to use the information contained within timestamp in the cats_uk dataset. This one is tricky, since essentially I just want to figure out the time of day the cat was hunting. Each cat has MANY timestamps listed, which also include the date. Since this is exploratory (and the cats were all tracked during roughly the same time period, June - November), all I really want is the approximate time of day the cat is typically hunting (i.e., morning, afternoon, evening, night). I will have to figure out how to average the timestamps for each cat, then split apart the timestamps so that I only get the time, then change the time into a format that I can categorize into “morning, afternoon, evening, night”! And then, I will need to join the two tables so that I can compare the typical time of day each cat hunts with the average number of prey each cat brings home per month…

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

cats_avg_time <- cats_uk %>% 
  group_by(tag_id) %>% #group by the tag_id, then create a new variable which gives the mean of each tag's list of timestamps
  mutate(avg_time = mean.POSIXct(timestamp)) %>% 
  select(tag_id, avg_time) %>% #we only want the tag_id and the mean timestamp, so select only those variables
  distinct(tag_id,avg_time) #now there are a bunch of repeats for each tag, so we only want one iteration of each tag and its accompanying average timestamp

#adapted from: https://stackoverflow.com/questions/35911966/taking-the-average-of-posix-times-in-y-m-d-hms-format

cat_times <- cats_avg_time %>% mutate(time_simple = format(as.POSIXct(avg_time), format = "%H%M")) %>% #separate the date and time from the average timestamp variable
  mutate(time_simple = as.numeric(time_simple)) #change the data type from character (default) to numeric, so we can work with it

#adapted from: https://www.geeksforgeeks.org/how-to-separate-date-and-time-in-r/

cat_times #check to make sure our time_simple variable is numeric!
## # A tibble: 101 × 3
## # Groups:   tag_id [101]
##    tag_id       avg_time            time_simple
##    <chr>        <dttm>                    <dbl>
##  1 Ares         2017-06-27 08:45:12         845
##  2 Athena       2017-06-26 11:19:07        1119
##  3 Lola         2017-06-28 13:40:17        1340
##  4 Jago         2017-07-01 08:24:54         824
##  5 Maverick     2017-06-30 07:20:17         720
##  6 Charlie      2017-07-02 03:55:51         355
##  7 Coco         2017-07-02 04:34:45         434
##  8 Friday       2017-07-04 19:54:28        1954
##  9 Meg-Tag      2017-07-05 12:01:20        1201
## 10 Morpheus-Tag 2017-07-05 11:05:12        1105
## # … with 91 more rows
cat_times <- cat_times %>% 
  mutate(part_of_day = case_when( #split the time_simple into categories for parts of the day
    (time_simple <0500) | (time_simple >=2100) ~ "Night",
    (time_simple >= 0500) & (time_simple <1159) ~ "Morning",
    (time_simple >= 1200) & (time_simple < 1700) ~ "Afternoon",
    (time_simple >= 1700) & (time_simple <2100) ~ "Evening"
  ))

cat_egories <- cat_times %>% 
  select(tag_id, part_of_day) #before joining, we only want the tag_id and the new part_of_day variable

cat_egories #make sure the variables we want are showing up correctly
## # A tibble: 101 × 2
## # Groups:   tag_id [101]
##    tag_id       part_of_day
##    <chr>        <chr>      
##  1 Ares         Morning    
##  2 Athena       Morning    
##  3 Lola         Afternoon  
##  4 Jago         Morning    
##  5 Maverick     Morning    
##  6 Charlie      Night      
##  7 Coco         Night      
##  8 Friday       Evening    
##  9 Meg-Tag      Afternoon  
## 10 Morpheus-Tag Morning    
## # … with 91 more rows
full_cats <- cats_uk_reference %>% 
  left_join(y = cat_egories,
            by = c("tag_id" = "tag_id")) #join the two tables together! (rationale to follow)

study_cats <- full_cats %>% select(animal_id, hunt, prey_p_month, animal_sex, hrs_indoors, food_wet, age_years, part_of_day) %>% 
  filter(hunt == TRUE) #create new data frame that has only our variables of interest

study_cats #take a look at our pretty new dataset!
## # A tibble: 81 × 8
##    animal_id hunt  prey_p_month animal_sex hrs_indoors food_wet age_ye…¹ part_…²
##    <chr>     <lgl>        <dbl> <chr>            <dbl> <lgl>       <dbl> <chr>  
##  1 Tommy     TRUE          12.5 m                 12.5 TRUE           11 Morning
##  2 Athena    TRUE           3   f                  7.5 TRUE            3 Morning
##  3 Lola      TRUE           3   f                 17.5 TRUE           10 Aftern…
##  4 Maverick  TRUE           3   m                 12.5 TRUE            7 Morning
##  5 Coco      TRUE           3   f                 12.5 TRUE            7 Night  
##  6 Charlie   TRUE           3   m                 12.5 TRUE            6 Night  
##  7 Jago      TRUE          17.5 m                  7.5 TRUE            2 Morning
##  8 Morpheus  TRUE           3   m                  2.5 FALSE           4 Morning
##  9 Nettle    TRUE           7.5 f                 12.5 FALSE           4 Evening
## 10 Meg       TRUE           3   f                 12.5 TRUE            8 Aftern…
## # … with 71 more rows, and abbreviated variable names ¹​age_years, ²​part_of_day

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

For this data, we use left_join. Since each table has tag_id as a variable, and we want to compare the other variables in reference to each specific animal, we need to use left_join by tag_id in order to take all the variables from the two tables and combine them to match up by tag_id.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(study_cats)
## Rows: 81
## Columns: 8
## $ animal_id    <chr> "Tommy", "Athena", "Lola", "Maverick", "Coco", "Charlie",…
## $ hunt         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ prey_p_month <dbl> 12.5, 3.0, 3.0, 3.0, 3.0, 3.0, 17.5, 3.0, 7.5, 3.0, 0.5, …
## $ animal_sex   <chr> "m", "f", "f", "m", "f", "m", "m", "m", "f", "f", "m", "m…
## $ hrs_indoors  <dbl> 12.5, 7.5, 17.5, 12.5, 12.5, 12.5, 7.5, 2.5, 12.5, 12.5, …
## $ food_wet     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, T…
## $ age_years    <dbl> 11, 3, 10, 7, 7, 6, 2, 4, 4, 8, 1, 10, 11, 11, 8, 3, 5, 4…
## $ part_of_day  <chr> "Morning", "Morning", "Afternoon", "Morning", "Night", "N…

Are the values what you expected for the variables? Why or Why not?

(I’m not entirely sure what this question is asking.) The variables are within the range that I expected given the initial datasets, and they are all of the correct data types. There are fewer rows than the initial dataset, since we filtered to only show the cats that are allowed to hunt (as is pertinent to our research question).

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

study_cats %>%
  group_by(part_of_day) %>% 
  summarize(prey_p_month) #based on part of day that cat hunts, show the prey per month
## `summarise()` has grouped output by 'part_of_day'. You can override using the
## `.groups` argument.
## # A tibble: 81 × 2
## # Groups:   part_of_day [4]
##    part_of_day prey_p_month
##    <chr>              <dbl>
##  1 Afternoon            3  
##  2 Afternoon            3  
##  3 Afternoon            0.5
##  4 Afternoon            0.5
##  5 Afternoon            0.5
##  6 Afternoon           12.5
##  7 Afternoon            3  
##  8 Afternoon            3  
##  9 Afternoon           17.5
## 10 Afternoon            0.5
## # … with 71 more rows
study_cats %>% 
  group_by(food_wet) %>% 
  summarize(prey_p_month) #based on whether the cat eats wet food, show the prey per month
## `summarise()` has grouped output by 'food_wet'. You can override using the
## `.groups` argument.
## # A tibble: 81 × 2
## # Groups:   food_wet [2]
##    food_wet prey_p_month
##    <lgl>           <dbl>
##  1 FALSE             3  
##  2 FALSE             7.5
##  3 FALSE             0.5
##  4 FALSE             0.5
##  5 FALSE             3  
##  6 FALSE             3  
##  7 FALSE             7.5
##  8 FALSE            17.5
##  9 FALSE             3  
## 10 FALSE             3  
## # … with 71 more rows
study_cats %>% 
  group_by(animal_sex) %>% 
  summarize(prey_p_month) #based on the cat's sex, show the prey per month
## `summarise()` has grouped output by 'animal_sex'. You can override using the
## `.groups` argument.
## # A tibble: 81 × 2
## # Groups:   animal_sex [2]
##    animal_sex prey_p_month
##    <chr>             <dbl>
##  1 f                   3  
##  2 f                   3  
##  3 f                   3  
##  4 f                   7.5
##  5 f                   3  
##  6 f                   3  
##  7 f                   0.5
##  8 f                  12.5
##  9 f                   3  
## 10 f                  12.5
## # … with 71 more rows
study_cats %>% 
  group_by(hrs_indoors) %>% 
  summarize(prey_p_month) #based on how many hours the cat spends indoors, show the prey per month
## `summarise()` has grouped output by 'hrs_indoors'. You can override using the
## `.groups` argument.
## # A tibble: 81 × 2
## # Groups:   hrs_indoors [5]
##    hrs_indoors prey_p_month
##          <dbl>        <dbl>
##  1         2.5          3  
##  2         2.5          3  
##  3         2.5          3  
##  4         2.5          3  
##  5         2.5          7.5
##  6         2.5         17.5
##  7         2.5         17.5
##  8         7.5          3  
##  9         7.5         17.5
## 10         7.5          0.5
## # … with 71 more rows

What are your findings about the summary? Are they what you expected?

It is nearly impossible to gauge how the variables are related based solely on looking at the number in a table. We really need visualization (and statistical analysis) for this! (I also didn’t expect to glean much information from looking at just the tables of numbers; so in a way, yes, this is what I expected.)

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(study_cats,
       aes(x = part_of_day,
           y = prey_p_month)
       ) +
  geom_boxplot() +
  geom_jitter() +
  labs(x = "Typical Time of Day for Hunting",
       y = "Number of Prey Brought Home per Month",
       title = "Number of Prey vs. Hunting Schedule")

ggplot(study_cats,
       aes(x = food_wet,
           y = prey_p_month)
       ) +
  geom_boxplot() +
  geom_jitter() +
  labs(x = "Given Wet Food",
       y = "Number of Prey Brought Home per Month",
       title = "Number of Prey vs. Type of Food Provided")

ggplot(study_cats,
       aes(x = animal_sex,
           y = prey_p_month)
       ) +
  geom_violin() +
  geom_jitter() +
  labs(x = "Sex of Cat",
       y = "Number of Prey Brought Home per Month",
       title = "Number of Prey vs. Sex")

ggplot(study_cats,
       aes(x = hrs_indoors,
           y = prey_p_month,
           color = food_wet)
       ) +
  geom_jitter() +
  labs(x = "Hours per Day Spent Indoors",
       y = "Number of Prey Brought Home per Month",
       title = "Number of Prey vs. Time Indoors, by Food Type")

Number of Prey vs. Hunting Schedule From visual inspection, it appears that the most prey are caught during morning hunting times, although there is considerable spread among Afternoon, Morning, and Night hunting times.

Number of Prey vs. Type of Food Provided From visual inspection, it appears that on average, cats that do not receive wet food bring home more prey per month. This is interesting, especially since there are many more cats that are given wet food.

Number of Prey vs. Sex From visual inspection, it is difficult to tell whether there is a difference between number of prey per month and the sex of the cats. They are both skewed towards the lower number of prey but both have quite a bit of spread and outliers with very high numbers of prey.

Number of Prey vs. Time Indoors, by Food Type From visual inspection, it appears that in general the cats that spend more time indoors bring home less prey per month (which makes sense). It is difficult to tell from visuals alone whether or not there is any relationship once we consider whether the cats are given wet food.

Final Summary (10 points)

Summarize your research question and findings below.

1. Is the typical time of day each cat hunts associated with how many prey they get per month?

From our preliminary, exploratory data analysis, it does seem that there could be an association between a cat’s typical hunting schedule (time of day) and how many prey they bring home per month. From the boxplot/jitterplot alone, it does looks like cats who hunt primarily in the morning tend to bring home more prey, although this distribution is also pretty close to the one of nighttime hunters. Additionally, it should be noted that my categories of morning/afternoon/evening/night were rough estimates and somewhat arbitrary. Since this was based on each cat’s average, it doesn’t take into account the fact that cats could have bi- or multi-modal hunting schedules.

2. Do cats who get wet food bring home less prey per month?

Again, based on our preliminary findings, it does seem like the type of food a cat is given is associated with how many prey they bring home per month. It appears that cats who are not fed wet food tend to have higher prey counts (although, of course, we are unable to determine causality). Also, it is worth noting that the majority of cats are fed wet food, so the sample size of cats without wet food is small to begin with.

3. Is there a difference in prey per month between male and female cats?

From our plot above, there does not appear to be an immediately obvious difference in prey counts between male and female cats. However, this doesn’t take into account the reproductive status (intact, spayed/neutered), which may affect the sex differences.

4. Is there a relationship between prey per month and number of hours the cat spends indoors?

From our plot, it appears that there could be a relationship between monthly prey count and time the cats spend indoors. In general, it looks like the prey counts decrease as the time spent indoors increases, which makes sense. However, there is a cluster of possible outliers with both low prey counts and low amounts of time spent indoors (so they are outside a lot but not bringing home much prey). Perhaps this is a coincidence, or perhaps the cats spend so much time outside that their owners are unaware of all the prey caught during the day? Additionally, we looked at this relationship with the additional variable of food type, but there doesn’t seem to be a strong association either way between wet and dry food.

Are your findings what you expected? Why or Why not?

1. Number of Prey vs. Hunting Schedule

The findings are somewhat what I expected, since I expected some kind of difference but didn’t know what time of day would be more/less associated with prey counts.

2. Number of Prey vs. Type of Food Provided

These findings are what I expected: that cats receiving wet food would in general bring home less prey. This was just based on a hunch that cats, as obligate carnivores, crave the taste/texture of meat (even if they also like and eat dry kibble).

3. Number of Prey vs. Sex

I didn’t strongly expect a difference between the sexes, although I thought perhaps that male cats would bring home more prey. From the plots, it appears that there’s no obvious difference between male and female cats, which is somewhat unexpected.

4. Number of Prey vs. Time Indoors

I expected that cats spending more time indoors would have less time to catch prey, and therefore have lower prey counts. It appears this may be the case since there seems to be an inverse relationship between hours spent indoors and prey per month.