#loading packages
pacman::p_load(
skimr,
tidyverse,
readxl,
visdat,
gtsummary,
janitor,
readr,
here,
ggplot2,
ghibli,
paletteer,
forcats,
gt
)
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
This dataset has to do with UFO sightings around the world. I’m interested in looking at what area of the U.S. has had the most recorded UFO sightings, and I hypothesize that California is the top contributor. However, I’m also interested in looking at whether the shapes of the UFOs differ based on where the UFO was sighted.
I have always been interested in UFOs and, based on recent new reports, have found that my interest in this topic has peaked! Thinking about UFOs and the existence of aliens is terrifying and amazing at the same time, so it will be fun to look at a dataset that explores this. The data can be found here:
https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-06-25
Given your question, what is your expectation about the data?
I expect that California will have the highest recorded UFO sightings to date. I also expect that the UFO shape most seen for all areas of the U.S. is a round shape, especially in the western part of the U.S.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
#Loading the data
ufo_sightings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")
## Rows: 80332 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): date_time, city_area, state, country, ufo_shape, described_encounte...
## dbl (3): encounter_length, latitude, longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Looking at the data with glimpse() and skim()
glimpse(ufo_sightings)
## Rows: 80,332
## Columns: 11
## $ date_time <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10…
## $ city_area <chr> "san marcos", "lackland afb", "chester (uk/…
## $ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct",…
## $ country <chr> "us", NA, "gb", "us", "us", "us", "gb", "us…
## $ ufo_shape <chr> "cylinder", "light", "circle", "circle", "l…
## $ encounter_length <dbl> 2700, 7200, 20, 20, 900, 300, 180, 1200, 18…
## $ described_encounter_length <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2…
## $ description <chr> "This event took place in early fall around…
## $ date_documented <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/…
## $ latitude <dbl> 29.88306, 29.38421, 53.20000, 28.97833, 21.…
## $ longitude <dbl> -97.941111, -98.581082, -2.916667, -96.6458…
skim(ufo_sightings)
Name | ufo_sightings |
Number of rows | 80332 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
date_time | 0 | 1.00 | 14 | 16 | 0 | 69586 | 0 |
city_area | 0 | 1.00 | 1 | 69 | 0 | 19900 | 0 |
state | 5797 | 0.93 | 2 | 2 | 0 | 67 | 0 |
country | 9670 | 0.88 | 2 | 2 | 0 | 5 | 0 |
ufo_shape | 1932 | 0.98 | 3 | 9 | 0 | 29 | 0 |
described_encounter_length | 0 | 1.00 | 2 | 31 | 0 | 8349 | 0 |
description | 15 | 1.00 | 1 | 246 | 0 | 79996 | 0 |
date_documented | 0 | 1.00 | 8 | 10 | 0 | 317 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
encounter_length | 3 | 1 | 9017.23 | 620228.37 | 0.00 | 30.00 | 180.00 | 600.00 | 9.7836e+07 | ▇▁▁▁▁ |
latitude | 1 | 1 | 38.12 | 10.47 | -82.86 | 34.13 | 39.41 | 42.79 | 7.2700e+01 | ▁▁▁▇▅ |
longitude | 0 | 1 | -86.77 | 39.70 | -176.66 | -112.07 | -87.90 | -78.75 | 1.7844e+02 | ▃▇▁▁▁ |
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
First, there are some NAs in this dataset which I will remove below. Missing data is marked as “NA” only. I will remove all rows that have an NA listed, as this will make it easier to analyze the rest of the data for this specific project. Second, I would also like to capitalize the states and countries. Third, I would like to create a new variable that groups the encounter times into smaller categories. There was a variable in the original dataset that did this, but it was very messy and so I wanted a cleaner version. Fourth, since there are 28 UFO shapes, I would like to combine them into smaller categories. Lastly, I would also like to group the states into smaller areas of the U.S. Both of these transformations will occur closer to the end of this analysis when I’m creating my last graph.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
#creating a new dataset with just the variables I'm interested in (cleaning the data set)
ufo_sightings_clean<-ufo_sightings%>%select(date_time, city_area, state, country, ufo_shape, encounter_length)
I wanted to only select these variables for my dataset: date_time, city_area, state, country, ufo_shape, and encounter_length(in seconds). I also only wanted information about the US so I will need to remove all other countries from the country variable.
#checking how many NAs are in each variable and dividing by number of total rows
(is.na(ufo_sightings_clean$date_time)%>% sum())/nrow(ufo_sightings_clean)
## [1] 0
(is.na(ufo_sightings_clean$city_area)%>% sum())/nrow(ufo_sightings_clean)
## [1] 0
(is.na(ufo_sightings_clean$state)%>% sum())/nrow(ufo_sightings_clean)
## [1] 0.07216302
(is.na(ufo_sightings_clean$country)%>% sum())/nrow(ufo_sightings_clean)
## [1] 0.1203754
(is.na(ufo_sightings_clean$ufo_shape)%>% sum())/nrow(ufo_sightings_clean)
## [1] 0.02405019
(is.na(ufo_sightings_clean$encounter_length)%>% sum())/nrow(ufo_sightings_clean)
## [1] 3.734502e-05
#There is only one variable in my dataset that has >10% NAs (the country variable at 12%). However, for this project I am still choosing to keep the variable in my dataset.
#removing all rows with NA or missing values
ufo_sightings_clean_noNA<-na.omit(ufo_sightings_clean)
#removing all other countries besides the US from the country variable
ufo_sightings_clean_noNA_US<-ufo_sightings_clean_noNA %>%
filter(country == "us")
ufo_sightings_clean_noNA_US
## # A tibble: 63,561 × 6
## date_time city_area state country ufo_shape encounter_length
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 10/10/1949 20:30 san marcos tx us cylinder 2700
## 2 10/10/1956 21:00 edna tx us circle 20
## 3 10/10/1960 20:00 kaneohe hi us light 900
## 4 10/10/1961 19:00 bristol tn us sphere 300
## 5 10/10/1965 23:45 norwalk ct us disk 1200
## 6 10/10/1966 20:00 pell city al us disk 180
## 7 10/10/1966 21:00 live oak fl us disk 120
## 8 10/10/1968 13:00 hawthorne ca us circle 300
## 9 10/10/1968 19:00 brevard nc us fireball 180
## 10 10/10/1970 16:00 bellmore ny us disk 1800
## # … with 63,551 more rows
#The filter function removed 2,963 observations that were outside of the US, leaving 6,3561 observations
#mutating the encounter_length variable
ufo_sightings_clean_noNA_US<-ufo_sightings_clean_noNA_US%>%
mutate(
encounter_range = factor(case_when(
encounter_length < 60 ~ "0-59 seconds",
encounter_length >= 60 & encounter_length < 300 ~ "1-4.9 minutes",
encounter_length >= 300 & encounter_length < 600 ~ "5-9.9 minutes",
encounter_length >= 600 & encounter_length < 1800 ~ "10-29.9 minutes",
encounter_length >= 1800 & encounter_length < 3600 ~ "30-59.9 minutes",
encounter_length >= 3600 ~ "+60 minutes"
),levels = c("0-59 seconds",
"1-4.9 minutes",
"5-9.9 minutes",
"10-29.9 minutes",
"30-59.9 minutes",
"+60 minutes"))
)
ufo_sightings_clean_noNA_US
## # A tibble: 63,561 × 7
## date_time city_area state country ufo_shape encounter_length encoun…¹
## <chr> <chr> <chr> <chr> <chr> <dbl> <fct>
## 1 10/10/1949 20:30 san marcos tx us cylinder 2700 30-59.9…
## 2 10/10/1956 21:00 edna tx us circle 20 0-59 se…
## 3 10/10/1960 20:00 kaneohe hi us light 900 10-29.9…
## 4 10/10/1961 19:00 bristol tn us sphere 300 5-9.9 m…
## 5 10/10/1965 23:45 norwalk ct us disk 1200 10-29.9…
## 6 10/10/1966 20:00 pell city al us disk 180 1-4.9 m…
## 7 10/10/1966 21:00 live oak fl us disk 120 1-4.9 m…
## 8 10/10/1968 13:00 hawthorne ca us circle 300 5-9.9 m…
## 9 10/10/1968 19:00 brevard nc us fireball 180 1-4.9 m…
## 10 10/10/1970 16:00 bellmore ny us disk 1800 30-59.9…
## # … with 63,551 more rows, and abbreviated variable name ¹encounter_range
#checking the encounter range
ufo_sightings_clean_noNA_US%>%
tabyl(encounter_range)
## encounter_range n percent
## 0-59 seconds 18399 0.28946996
## 1-4.9 minutes 16942 0.26654710
## 5-9.9 minutes 8777 0.13808782
## 10-29.9 minutes 11674 0.18366608
## 30-59.9 minutes 3621 0.05696890
## +60 minutes 4148 0.06526014
ufo_sightings_clean_noNA_US
## # A tibble: 63,561 × 7
## date_time city_area state country ufo_shape encounter_length encoun…¹
## <chr> <chr> <chr> <chr> <chr> <dbl> <fct>
## 1 10/10/1949 20:30 san marcos tx us cylinder 2700 30-59.9…
## 2 10/10/1956 21:00 edna tx us circle 20 0-59 se…
## 3 10/10/1960 20:00 kaneohe hi us light 900 10-29.9…
## 4 10/10/1961 19:00 bristol tn us sphere 300 5-9.9 m…
## 5 10/10/1965 23:45 norwalk ct us disk 1200 10-29.9…
## 6 10/10/1966 20:00 pell city al us disk 180 1-4.9 m…
## 7 10/10/1966 21:00 live oak fl us disk 120 1-4.9 m…
## 8 10/10/1968 13:00 hawthorne ca us circle 300 5-9.9 m…
## 9 10/10/1968 19:00 brevard nc us fireball 180 1-4.9 m…
## 10 10/10/1970 16:00 bellmore ny us disk 1800 30-59.9…
## # … with 63,551 more rows, and abbreviated variable name ¹encounter_range
#capitalizing states and countries
ufo_sightings_clean_noNA_US <- ufo_sightings_clean_noNA_US %>%
mutate(
across(.cols = c(state, country),
.fns = str_to_upper)
) %>%
glimpse()
## Rows: 63,561
## Columns: 7
## $ date_time <chr> "10/10/1949 20:30", "10/10/1956 21:00", "10/10/1960 2…
## $ city_area <chr> "san marcos", "edna", "kaneohe", "bristol", "norwalk"…
## $ state <chr> "TX", "TX", "HI", "TN", "CT", "AL", "FL", "CA", "NC",…
## $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US",…
## $ ufo_shape <chr> "cylinder", "circle", "light", "sphere", "disk", "dis…
## $ encounter_length <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, 180, 1800, 1…
## $ encounter_range <fct> 30-59.9 minutes, 0-59 seconds, 10-29.9 minutes, 5-9.9…
#capitalizing cities
ufo_sightings_clean_noNA_US <- ufo_sightings_clean_noNA_US %>%
mutate(
across(.cols = c(city_area),
.fns = str_to_title)
) %>%
glimpse()
## Rows: 63,561
## Columns: 7
## $ date_time <chr> "10/10/1949 20:30", "10/10/1956 21:00", "10/10/1960 2…
## $ city_area <chr> "San Marcos", "Edna", "Kaneohe", "Bristol", "Norwalk"…
## $ state <chr> "TX", "TX", "HI", "TN", "CT", "AL", "FL", "CA", "NC",…
## $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US",…
## $ ufo_shape <chr> "cylinder", "circle", "light", "sphere", "disk", "dis…
## $ encounter_length <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, 180, 1800, 1…
## $ encounter_range <fct> 30-59.9 minutes, 0-59 seconds, 10-29.9 minutes, 5-9.9…
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
glimpse(ufo_sightings_clean_noNA_US)
## Rows: 63,561
## Columns: 7
## $ date_time <chr> "10/10/1949 20:30", "10/10/1956 21:00", "10/10/1960 2…
## $ city_area <chr> "San Marcos", "Edna", "Kaneohe", "Bristol", "Norwalk"…
## $ state <chr> "TX", "TX", "HI", "TN", "CT", "AL", "FL", "CA", "NC",…
## $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US",…
## $ ufo_shape <chr> "cylinder", "circle", "light", "sphere", "disk", "dis…
## $ encounter_length <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, 180, 1800, 1…
## $ encounter_range <fct> 30-59.9 minutes, 0-59 seconds, 10-29.9 minutes, 5-9.9…
Are the values what you expected for the variables? Why or Why not?
Yes, they are what I expected. I think the data looks cleaner this way, especially with the new encounter range variable and the states and country variables capitalized.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#Looking at the average encounter length time for all states
range<-ufo_sightings_clean_noNA_US%>%
group_by(encounter_range)%>%
summarize(count = n())
range
## # A tibble: 6 × 2
## encounter_range count
## <fct> <int>
## 1 0-59 seconds 18399
## 2 1-4.9 minutes 16942
## 3 5-9.9 minutes 8777
## 4 10-29.9 minutes 11674
## 5 30-59.9 minutes 3621
## 6 +60 minutes 4148
states<-ufo_sightings_clean_noNA_US %>%
group_by(state) %>%
summarize(count = n())
#California has had the most amount of recorded UFO encounters
(states%>%
arrange(-count))
## # A tibble: 52 × 2
## state count
## <chr> <int>
## 1 CA 8684
## 2 FL 3754
## 3 WA 3708
## 4 TX 3399
## 5 NY 2915
## 6 IL 2447
## 7 AZ 2362
## 8 PA 2319
## 9 OH 2252
## 10 MI 1781
## # … with 42 more rows
#While areas in the U.S., DC and Puerto Rico are not states. However I included them in this analysis as they are part of the U.S. They have had the least amount of recorded UFO encounters, followed by North Dakota and Delaware.
(states%>%
arrange(count))
## # A tibble: 52 × 2
## state count
## <chr> <int>
## 1 DC 7
## 2 PR 24
## 3 ND 123
## 4 DE 165
## 5 WY 169
## 6 SD 177
## 7 RI 224
## 8 VT 254
## 9 HI 257
## 10 AK 311
## # … with 42 more rows
What are your findings about the summary? Are they what you expected?
Yes, my hypothesis was correct that California has had the most reported UFO encounters. However, I am surprised at how many more encounters California has had compared to all the other states and areas of the U.S.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
#Graph displaying total UFO Exposures based on State
ggplot(data = states) + aes(y = reorder(state,-count), x = count) + geom_col(fill = "darkslategrey") + theme_classic()+xlab("UFO Exposures")+ylab("States")
#combining UFO shapes into smaller categories (Other, Triangle, Round, Light)
ufo_shapes<-ufo_sightings_clean_noNA_US%>%
mutate(
shape = factor(case_when(
ufo_shape %in% c("unknown", "other", "formation", "changed", "changing", "cross", "hexagon", "cylinder", "cigar", "rectangle") ~ "Other",
ufo_shape %in% c("cone", "chevron", "delta", "triangle", "teardrop", "pyramid", "diamond") ~ "Triangle",
ufo_shape %in% c("sphere", "oval", "round", "egg", "circle", "disk", "crescent") ~ "Round",
ufo_shape %in% c("flash", "light", "flare", "fireball") ~ "Light",
)))
ufo_shapes
## # A tibble: 63,561 × 8
## date_time city_area state country ufo_shape encounter…¹ encou…² shape
## <chr> <chr> <chr> <chr> <chr> <dbl> <fct> <fct>
## 1 10/10/1949 20:30 San Marcos TX US cylinder 2700 30-59.… Other
## 2 10/10/1956 21:00 Edna TX US circle 20 0-59 s… Round
## 3 10/10/1960 20:00 Kaneohe HI US light 900 10-29.… Light
## 4 10/10/1961 19:00 Bristol TN US sphere 300 5-9.9 … Round
## 5 10/10/1965 23:45 Norwalk CT US disk 1200 10-29.… Round
## 6 10/10/1966 20:00 Pell City AL US disk 180 1-4.9 … Round
## 7 10/10/1966 21:00 Live Oak FL US disk 120 1-4.9 … Round
## 8 10/10/1968 13:00 Hawthorne CA US circle 300 5-9.9 … Round
## 9 10/10/1968 19:00 Brevard NC US fireball 180 1-4.9 … Light
## 10 10/10/1970 16:00 Bellmore NY US disk 1800 30-59.… Round
## # … with 63,551 more rows, and abbreviated variable names ¹encounter_length,
## # ²encounter_range
shapes<-ufo_shapes %>%
group_by(state) %>%
summarize(shape)
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
#combining states into smaller areas of the U.S. (Northeast, Southwest, Southeast, Midwest)
shapes1<-ufo_shapes%>%
mutate(
Area = factor(case_when(
state %in% c("ME", "MA", "RI", "CT", "NH", "VT", "NY", "PA", "NJ", "DE", "MD", "DC") ~ "Northeast",
state %in% c("TX", "OK", "NM", "AZ") ~ "Southwest",
state %in% c("CO", "WY", "MT", "ID", "WA", "OR", "UT", "NV", "AK", "CA", "HI") ~ "West",
state %in% c("WV", "VA", "KY", "TN", "NC", "SC", "GA", "AL", "MS", "AR", "LA", "FL", "PR") ~ "Southest",
state %in% c("OH", "IN", "MI", "IL", "MO", "WI", "MN", "IA", "KS", "NE", "SD", "ND") ~ "Midwest"
)))
shapes1
## # A tibble: 63,561 × 9
## date_time city_area state country ufo_s…¹ encou…² encou…³ shape Area
## <chr> <chr> <chr> <chr> <chr> <dbl> <fct> <fct> <fct>
## 1 10/10/1949 20:30 San Marcos TX US cylind… 2700 30-59.… Other Sout…
## 2 10/10/1956 21:00 Edna TX US circle 20 0-59 s… Round Sout…
## 3 10/10/1960 20:00 Kaneohe HI US light 900 10-29.… Light West
## 4 10/10/1961 19:00 Bristol TN US sphere 300 5-9.9 … Round Sout…
## 5 10/10/1965 23:45 Norwalk CT US disk 1200 10-29.… Round Nort…
## 6 10/10/1966 20:00 Pell City AL US disk 180 1-4.9 … Round Sout…
## 7 10/10/1966 21:00 Live Oak FL US disk 120 1-4.9 … Round Sout…
## 8 10/10/1968 13:00 Hawthorne CA US circle 300 5-9.9 … Round West
## 9 10/10/1968 19:00 Brevard NC US fireba… 180 1-4.9 … Light Sout…
## 10 10/10/1970 16:00 Bellmore NY US disk 1800 30-59.… Round Nort…
## # … with 63,551 more rows, and abbreviated variable names ¹ufo_shape,
## # ²encounter_length, ³encounter_range
#graph displaying UFO encounters based on shape given U.S. area in which the encounter was recorded
value<-shapes1$shape
ggplot(shapes1, aes(fill=Area, x=shape))+
geom_bar(position="stack")+
scale_fill_manual(values=c('deepskyblue2', 'cornflowerblue', 'cadetblue1', 'aquamarine3', 'deeppink4'))+ theme(axis.text.y = element_blank())
Summarize your research question and findings below.
California has reported the most UFOs, while North Dakota has reported the least (along with DC and Puerto Rico). Addtionally, most reports from all states were of some sort of lights, followed by round shapes.
Are your findings what you expected? Why or Why not?
The findings are what I expected in terms of California having the most UFO encounters of all time. However, people typically report UFO lights more often than round shapes like I hypothesized.