Midterm

#loading packages
pacman::p_load(
  skimr,        
  tidyverse,    
  readxl,       
  visdat,      
  gtsummary,     
  janitor, 
  readr, 
  here,
  ggplot2,
  ghibli,       
  paletteer,
  forcats,      
  gt 
  )

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

This dataset has to do with UFO sightings around the world. I’m interested in looking at what area of the U.S. has had the most recorded UFO sightings, and I hypothesize that California is the top contributor. However, I’m also interested in looking at whether the shapes of the UFOs differ based on where the UFO was sighted.

I have always been interested in UFOs and, based on recent new reports, have found that my interest in this topic has peaked! Thinking about UFOs and the existence of aliens is terrifying and amazing at the same time, so it will be fun to look at a dataset that explores this. The data can be found here:

https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-06-25

Given your question, what is your expectation about the data?

I expect that California will have the highest recorded UFO sightings to date. I also expect that the UFO shape most seen for all areas of the U.S. is a round shape, especially in the western part of the U.S.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#Loading the data
ufo_sightings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")

## Rows: 80332 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): date_time, city_area, state, country, ufo_shape, described_encounte...
## dbl (3): encounter_length, latitude, longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Looking at the data with glimpse() and skim()
glimpse(ufo_sightings)

## Rows: 80,332
## Columns: 11
## $ date_time                  <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10…
## $ city_area                  <chr> "san marcos", "lackland afb", "chester (uk/…
## $ state                      <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct",…
## $ country                    <chr> "us", NA, "gb", "us", "us", "us", "gb", "us…
## $ ufo_shape                  <chr> "cylinder", "light", "circle", "circle", "l…
## $ encounter_length           <dbl> 2700, 7200, 20, 20, 900, 300, 180, 1200, 18…
## $ described_encounter_length <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2…
## $ description                <chr> "This event took place in early fall around…
## $ date_documented            <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/…
## $ latitude                   <dbl> 29.88306, 29.38421, 53.20000, 28.97833, 21.…
## $ longitude                  <dbl> -97.941111, -98.581082, -2.916667, -96.6458…

skim(ufo_sightings)

Data summary
Name	ufo_sightings
Number of rows	80332
Number of columns	11
_______________________
Column type frequency:
character	8
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
date_time	0	1.00	14	16	69586
city_area	0	1.00	1	69	19900
state	5797	0.93	2	2	67
country	9670	0.88	2	2	5
ufo_shape	1932	0.98	3	9	29
described_encounter_length	0	1.00	2	31	8349
description	15	1.00	1	246	79996
date_documented	0	1.00	8	10	317

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
encounter_length	3	1	9017.23	620228.37	0.00	30.00	180.00	600.00	9.7836e+07	▇▁▁▁▁
latitude	1	1	38.12	10.47	-82.86	34.13	39.41	42.79	7.2700e+01	▁▁▁▇▅
longitude	0	1	-86.77	39.70	-176.66	-112.07	-87.90	-78.75	1.7844e+02	▃▇▁▁▁

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

First, there are some NAs in this dataset which I will remove below. Missing data is marked as “NA” only. I will remove all rows that have an NA listed, as this will make it easier to analyze the rest of the data for this specific project. Second, I would also like to capitalize the states and countries. Third, I would like to create a new variable that groups the encounter times into smaller categories. There was a variable in the original dataset that did this, but it was very messy and so I wanted a cleaner version. Fourth, since there are 28 UFO shapes, I would like to combine them into smaller categories. Lastly, I would also like to group the states into smaller areas of the U.S. Both of these transformations will occur closer to the end of this analysis when I’m creating my last graph.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

#creating a new dataset with just the variables I'm interested in (cleaning the data set)
ufo_sightings_clean<-ufo_sightings%>%select(date_time, city_area, state, country, ufo_shape, encounter_length)

I wanted to only select these variables for my dataset: date_time, city_area, state, country, ufo_shape, and encounter_length(in seconds). I also only wanted information about the US so I will need to remove all other countries from the country variable.

#checking how many NAs are in each variable and dividing by number of total rows
(is.na(ufo_sightings_clean$date_time)%>% sum())/nrow(ufo_sightings_clean)

## [1] 0

(is.na(ufo_sightings_clean$city_area)%>% sum())/nrow(ufo_sightings_clean)

## [1] 0

(is.na(ufo_sightings_clean$state)%>% sum())/nrow(ufo_sightings_clean)

## [1] 0.07216302

(is.na(ufo_sightings_clean$country)%>% sum())/nrow(ufo_sightings_clean)

## [1] 0.1203754

(is.na(ufo_sightings_clean$ufo_shape)%>% sum())/nrow(ufo_sightings_clean)

## [1] 0.02405019

(is.na(ufo_sightings_clean$encounter_length)%>% sum())/nrow(ufo_sightings_clean)

## [1] 3.734502e-05

#There is only one variable in my dataset that has >10% NAs (the country variable at 12%). However, for this project I am still choosing to keep the variable in my dataset. 


#removing all rows with NA or missing values
ufo_sightings_clean_noNA<-na.omit(ufo_sightings_clean)

#removing all other countries besides the US from the country variable
ufo_sightings_clean_noNA_US<-ufo_sightings_clean_noNA %>%
  filter(country == "us")
ufo_sightings_clean_noNA_US

## # A tibble: 63,561 × 6
##    date_time        city_area  state country ufo_shape encounter_length
##    <chr>            <chr>      <chr> <chr>   <chr>                <dbl>
##  1 10/10/1949 20:30 san marcos tx    us      cylinder              2700
##  2 10/10/1956 21:00 edna       tx    us      circle                  20
##  3 10/10/1960 20:00 kaneohe    hi    us      light                  900
##  4 10/10/1961 19:00 bristol    tn    us      sphere                 300
##  5 10/10/1965 23:45 norwalk    ct    us      disk                  1200
##  6 10/10/1966 20:00 pell city  al    us      disk                   180
##  7 10/10/1966 21:00 live oak   fl    us      disk                   120
##  8 10/10/1968 13:00 hawthorne  ca    us      circle                 300
##  9 10/10/1968 19:00 brevard    nc    us      fireball               180
## 10 10/10/1970 16:00 bellmore   ny    us      disk                  1800
## # … with 63,551 more rows

#The filter function removed 2,963 observations that were outside of the US, leaving 6,3561 observations

#mutating the encounter_length variable 
ufo_sightings_clean_noNA_US<-ufo_sightings_clean_noNA_US%>%
  mutate(
    encounter_range = factor(case_when(
      encounter_length < 60 ~ "0-59 seconds",
      encounter_length >= 60 & encounter_length < 300 ~ "1-4.9 minutes", 
      encounter_length >= 300 & encounter_length < 600 ~ "5-9.9 minutes", 
      encounter_length >= 600 & encounter_length < 1800 ~ "10-29.9 minutes", 
      encounter_length >= 1800 & encounter_length < 3600 ~ "30-59.9 minutes",
      encounter_length >= 3600 ~ "+60 minutes"
    ),levels = c("0-59 seconds", 
                 "1-4.9 minutes", 
                 "5-9.9 minutes", 
                 "10-29.9 minutes", 
                 "30-59.9 minutes", 
                 "+60 minutes"))
  )
ufo_sightings_clean_noNA_US

## # A tibble: 63,561 × 7
##    date_time        city_area  state country ufo_shape encounter_length encoun…¹
##    <chr>            <chr>      <chr> <chr>   <chr>                <dbl> <fct>   
##  1 10/10/1949 20:30 san marcos tx    us      cylinder              2700 30-59.9…
##  2 10/10/1956 21:00 edna       tx    us      circle                  20 0-59 se…
##  3 10/10/1960 20:00 kaneohe    hi    us      light                  900 10-29.9…
##  4 10/10/1961 19:00 bristol    tn    us      sphere                 300 5-9.9 m…
##  5 10/10/1965 23:45 norwalk    ct    us      disk                  1200 10-29.9…
##  6 10/10/1966 20:00 pell city  al    us      disk                   180 1-4.9 m…
##  7 10/10/1966 21:00 live oak   fl    us      disk                   120 1-4.9 m…
##  8 10/10/1968 13:00 hawthorne  ca    us      circle                 300 5-9.9 m…
##  9 10/10/1968 19:00 brevard    nc    us      fireball               180 1-4.9 m…
## 10 10/10/1970 16:00 bellmore   ny    us      disk                  1800 30-59.9…
## # … with 63,551 more rows, and abbreviated variable name ¹encounter_range

#checking the encounter range
ufo_sightings_clean_noNA_US%>%
  tabyl(encounter_range)

##  encounter_range     n    percent
##     0-59 seconds 18399 0.28946996
##    1-4.9 minutes 16942 0.26654710
##    5-9.9 minutes  8777 0.13808782
##  10-29.9 minutes 11674 0.18366608
##  30-59.9 minutes  3621 0.05696890
##      +60 minutes  4148 0.06526014

ufo_sightings_clean_noNA_US

## # A tibble: 63,561 × 7
##    date_time        city_area  state country ufo_shape encounter_length encoun…¹
##    <chr>            <chr>      <chr> <chr>   <chr>                <dbl> <fct>   
##  1 10/10/1949 20:30 san marcos tx    us      cylinder              2700 30-59.9…
##  2 10/10/1956 21:00 edna       tx    us      circle                  20 0-59 se…
##  3 10/10/1960 20:00 kaneohe    hi    us      light                  900 10-29.9…
##  4 10/10/1961 19:00 bristol    tn    us      sphere                 300 5-9.9 m…
##  5 10/10/1965 23:45 norwalk    ct    us      disk                  1200 10-29.9…
##  6 10/10/1966 20:00 pell city  al    us      disk                   180 1-4.9 m…
##  7 10/10/1966 21:00 live oak   fl    us      disk                   120 1-4.9 m…
##  8 10/10/1968 13:00 hawthorne  ca    us      circle                 300 5-9.9 m…
##  9 10/10/1968 19:00 brevard    nc    us      fireball               180 1-4.9 m…
## 10 10/10/1970 16:00 bellmore   ny    us      disk                  1800 30-59.9…
## # … with 63,551 more rows, and abbreviated variable name ¹encounter_range

#capitalizing states and countries
ufo_sightings_clean_noNA_US <- ufo_sightings_clean_noNA_US %>%
    mutate(
      across(.cols = c(state, country),
             .fns = str_to_upper)
      ) %>%
  glimpse()

## Rows: 63,561
## Columns: 7
## $ date_time        <chr> "10/10/1949 20:30", "10/10/1956 21:00", "10/10/1960 2…
## $ city_area        <chr> "san marcos", "edna", "kaneohe", "bristol", "norwalk"…
## $ state            <chr> "TX", "TX", "HI", "TN", "CT", "AL", "FL", "CA", "NC",…
## $ country          <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US",…
## $ ufo_shape        <chr> "cylinder", "circle", "light", "sphere", "disk", "dis…
## $ encounter_length <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, 180, 1800, 1…
## $ encounter_range  <fct> 30-59.9 minutes, 0-59 seconds, 10-29.9 minutes, 5-9.9…

#capitalizing cities
ufo_sightings_clean_noNA_US <- ufo_sightings_clean_noNA_US %>%
    mutate(
      across(.cols = c(city_area),
             .fns = str_to_title)
      ) %>%
  glimpse()

## Rows: 63,561
## Columns: 7
## $ date_time        <chr> "10/10/1949 20:30", "10/10/1956 21:00", "10/10/1960 2…
## $ city_area        <chr> "San Marcos", "Edna", "Kaneohe", "Bristol", "Norwalk"…
## $ state            <chr> "TX", "TX", "HI", "TN", "CT", "AL", "FL", "CA", "NC",…
## $ country          <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US",…
## $ ufo_shape        <chr> "cylinder", "circle", "light", "sphere", "disk", "dis…
## $ encounter_length <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, 180, 1800, 1…
## $ encounter_range  <fct> 30-59.9 minutes, 0-59 seconds, 10-29.9 minutes, 5-9.9…

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(ufo_sightings_clean_noNA_US)

## Rows: 63,561
## Columns: 7
## $ date_time        <chr> "10/10/1949 20:30", "10/10/1956 21:00", "10/10/1960 2…
## $ city_area        <chr> "San Marcos", "Edna", "Kaneohe", "Bristol", "Norwalk"…
## $ state            <chr> "TX", "TX", "HI", "TN", "CT", "AL", "FL", "CA", "NC",…
## $ country          <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US",…
## $ ufo_shape        <chr> "cylinder", "circle", "light", "sphere", "disk", "dis…
## $ encounter_length <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, 180, 1800, 1…
## $ encounter_range  <fct> 30-59.9 minutes, 0-59 seconds, 10-29.9 minutes, 5-9.9…

Are the values what you expected for the variables? Why or Why not?

Yes, they are what I expected. I think the data looks cleaner this way, especially with the new encounter range variable and the states and country variables capitalized.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#Looking at the average encounter length time for all states
range<-ufo_sightings_clean_noNA_US%>%
  group_by(encounter_range)%>%
  summarize(count = n())
range

## # A tibble: 6 × 2
##   encounter_range count
##   <fct>           <int>
## 1 0-59 seconds    18399
## 2 1-4.9 minutes   16942
## 3 5-9.9 minutes    8777
## 4 10-29.9 minutes 11674
## 5 30-59.9 minutes  3621
## 6 +60 minutes      4148

states<-ufo_sightings_clean_noNA_US %>%
  group_by(state) %>%
    summarize(count = n())

#California has had the most amount of recorded UFO encounters
(states%>%
  arrange(-count))

## # A tibble: 52 × 2
##    state count
##    <chr> <int>
##  1 CA     8684
##  2 FL     3754
##  3 WA     3708
##  4 TX     3399
##  5 NY     2915
##  6 IL     2447
##  7 AZ     2362
##  8 PA     2319
##  9 OH     2252
## 10 MI     1781
## # … with 42 more rows

#While areas in the U.S., DC and Puerto Rico are not states. However I included them in this analysis as they are part of the U.S. They have had the least amount of recorded UFO encounters, followed by North Dakota and Delaware. 
(states%>%
  arrange(count))

## # A tibble: 52 × 2
##    state count
##    <chr> <int>
##  1 DC        7
##  2 PR       24
##  3 ND      123
##  4 DE      165
##  5 WY      169
##  6 SD      177
##  7 RI      224
##  8 VT      254
##  9 HI      257
## 10 AK      311
## # … with 42 more rows

What are your findings about the summary? Are they what you expected?

Yes, my hypothesis was correct that California has had the most reported UFO encounters. However, I am surprised at how many more encounters California has had compared to all the other states and areas of the U.S.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#Graph displaying total UFO Exposures based on State
ggplot(data = states) + aes(y = reorder(state,-count), x = count) + geom_col(fill = "darkslategrey") + theme_classic()+xlab("UFO Exposures")+ylab("States")

#combining UFO shapes into smaller categories (Other, Triangle, Round, Light)
ufo_shapes<-ufo_sightings_clean_noNA_US%>%
  mutate(
      shape = factor(case_when(
      ufo_shape %in% c("unknown", "other", "formation", "changed", "changing", "cross", "hexagon", "cylinder", "cigar", "rectangle") ~ "Other",
      ufo_shape %in% c("cone", "chevron", "delta", "triangle", "teardrop",  "pyramid", "diamond")  ~ "Triangle", 
      ufo_shape %in% c("sphere", "oval", "round", "egg", "circle", "disk", "crescent") ~ "Round",
      ufo_shape %in% c("flash", "light", "flare", "fireball") ~ "Light",
    )))
ufo_shapes

## # A tibble: 63,561 × 8
##    date_time        city_area  state country ufo_shape encounter…¹ encou…² shape
##    <chr>            <chr>      <chr> <chr>   <chr>           <dbl> <fct>   <fct>
##  1 10/10/1949 20:30 San Marcos TX    US      cylinder         2700 30-59.… Other
##  2 10/10/1956 21:00 Edna       TX    US      circle             20 0-59 s… Round
##  3 10/10/1960 20:00 Kaneohe    HI    US      light             900 10-29.… Light
##  4 10/10/1961 19:00 Bristol    TN    US      sphere            300 5-9.9 … Round
##  5 10/10/1965 23:45 Norwalk    CT    US      disk             1200 10-29.… Round
##  6 10/10/1966 20:00 Pell City  AL    US      disk              180 1-4.9 … Round
##  7 10/10/1966 21:00 Live Oak   FL    US      disk              120 1-4.9 … Round
##  8 10/10/1968 13:00 Hawthorne  CA    US      circle            300 5-9.9 … Round
##  9 10/10/1968 19:00 Brevard    NC    US      fireball          180 1-4.9 … Light
## 10 10/10/1970 16:00 Bellmore   NY    US      disk             1800 30-59.… Round
## # … with 63,551 more rows, and abbreviated variable names ¹encounter_length,
## #   ²encounter_range

shapes<-ufo_shapes %>%
  group_by(state) %>%
    summarize(shape)

## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.

#combining states into smaller areas of the U.S. (Northeast, Southwest, Southeast, Midwest)
shapes1<-ufo_shapes%>%
  mutate(
      Area = factor(case_when(
      state %in% c("ME", "MA", "RI", "CT", "NH", "VT", "NY", "PA", "NJ", "DE", "MD", "DC") ~ "Northeast",
      state %in% c("TX", "OK", "NM", "AZ")  ~ "Southwest", 
      state %in% c("CO", "WY", "MT", "ID", "WA", "OR", "UT", "NV", "AK", "CA", "HI") ~ "West", 
      state %in% c("WV", "VA", "KY", "TN", "NC", "SC", "GA", "AL", "MS", "AR", "LA", "FL", "PR") ~ "Southest",
      state %in% c("OH", "IN", "MI", "IL", "MO", "WI", "MN", "IA", "KS", "NE", "SD", "ND") ~ "Midwest"
    )))
shapes1

## # A tibble: 63,561 × 9
##    date_time        city_area  state country ufo_s…¹ encou…² encou…³ shape Area 
##    <chr>            <chr>      <chr> <chr>   <chr>     <dbl> <fct>   <fct> <fct>
##  1 10/10/1949 20:30 San Marcos TX    US      cylind…    2700 30-59.… Other Sout…
##  2 10/10/1956 21:00 Edna       TX    US      circle       20 0-59 s… Round Sout…
##  3 10/10/1960 20:00 Kaneohe    HI    US      light       900 10-29.… Light West 
##  4 10/10/1961 19:00 Bristol    TN    US      sphere      300 5-9.9 … Round Sout…
##  5 10/10/1965 23:45 Norwalk    CT    US      disk       1200 10-29.… Round Nort…
##  6 10/10/1966 20:00 Pell City  AL    US      disk        180 1-4.9 … Round Sout…
##  7 10/10/1966 21:00 Live Oak   FL    US      disk        120 1-4.9 … Round Sout…
##  8 10/10/1968 13:00 Hawthorne  CA    US      circle      300 5-9.9 … Round West 
##  9 10/10/1968 19:00 Brevard    NC    US      fireba…     180 1-4.9 … Light Sout…
## 10 10/10/1970 16:00 Bellmore   NY    US      disk       1800 30-59.… Round Nort…
## # … with 63,551 more rows, and abbreviated variable names ¹ufo_shape,
## #   ²encounter_length, ³encounter_range

#graph displaying UFO encounters based on shape given U.S. area in which the encounter was recorded
value<-shapes1$shape

ggplot(shapes1, aes(fill=Area, x=shape))+
  geom_bar(position="stack")+
  scale_fill_manual(values=c('deepskyblue2', 'cornflowerblue', 'cadetblue1', 'aquamarine3', 'deeppink4'))+ theme(axis.text.y = element_blank())

Final Summary (10 points)

Summarize your research question and findings below.

California has reported the most UFOs, while North Dakota has reported the least (along with DC and Puerto Rico). Addtionally, most reports from all states were of some sort of lights, followed by round shapes.

Are your findings what you expected? Why or Why not?

The findings are what I expected in terms of California having the most UFO encounters of all time. However, people typically report UFO lights more often than round shapes like I hypothesized.