3 Polygon Data
Polygon Data Access, Preparation, and Exploratory Analysis in R
Date Modified: May 6, 2024
Authors: Mitchell Manware , Kyle P. Messier
Key Terms: Geospatial Data, Wildfire
Programming Language: R
3.1 Introduction
Wildfire smoke plume coverage data from the United States National Oceanic and Atmospheric Administration (NOAA) will be used to demonstrate using polygon data. This tutorial will cover polygon data with both the sf
(1) and terra
(9) packages separately, but the steps for accessing, downloading, and unzipping the data is the same for both packages.
This tutorial will demonstrate the following steps with polygon data in R
:
- Downloading data from a URL
- Importing data
- Checking data type, structure, and class
- Reclassifying data
- Computing summary and zonal statistics
- Plotting individual and multiple data sets
The exploratory analyses in this unit are designed for educational purposes only. The results of the following analyses are not peer-reviewed findings, nor are they based on any hypotheses.
3.2 Access, Download, and Unzip
The website URL where the NOAA wildfire smoke plume data exists is date-specific, meaning there is a unique URL for each daily data set. For the purpose of these exploratory analyses, wildfire smoke plume data from September 1, 2023 will be used.
Define three variables for day
, month
, and year
according to the date of interest.
The utils::download.file()
function downloads the file according to the defined URL and destination file.
url_noaa <- paste0(
"https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons",
"/Shapefile/",
year,
"/",
month,
"/hms_smoke",
year,
month,
day,
".zip"
)
destination_noaa <- paste0(
"/ YOUR FILE PATH /noaa_smoke",
year,
month,
day,
".zip"
)
download.file(
url_noaa,
destination_noaa
)
The file downloaded from the NOAA website is a .zip
file. Zip files need to be unzipped (decompressed) in order to access the data within. Unzip the NOAA wildfire smoke plume coverage file with utils::unzip()
.
Unzipping a .zip
file will decompress the contents within. Spatial data sets can be very large (i.e., > 1 GB), so check the size of the data before unzipping on your machine.
The numeric value size of each file is listed under Length
.
After inspecting the file sizes, unzip noaa_smoke20230901.zip
.
Inspecting the file with utils::unzip(list = TRUE)
returned the size of the file, but also the name of the data file of interest. The desired data file can also be identified with list.files()
.
Listing the contents of the unzipped file reveals four individual files. The data to be imported is stored in the hms_smoke20230901.shp
, but the other files contain important information for the .shp
file.
Deleting any of the supporting files (i.e., *.dbf
, *.prj
, or *.shx
) will disrupt the data import.
3.3 Data Preparation with sf
This section will focus on exploratory analyses with polygon data using the sf
package.
3.3.1 Import
Now that the contents of the zip file have been saved on your machine and the data file of interest has been identified, import the data with sf::st_read()
.
Although the supporting files are required to import a shapefile, only the file ending in .shp
needs to be imported
Importing hms_smoke20230901.shp
does not return a Warning:
message because the data set has native spatial features, and is therefore imported as an sf
object.
3.3.2 Inspect Structure
Inspect the structure of smoke_sf
to see its class, column names, column classes, and the first two (specified by vec.len = 2
) data points.
As mentioned previously, the smoke_sf
data set has native spatial features. These are reflected by the data set having classes of sf
and data.frame
, and the $geometry
feature.
3.3.3 Reclassify
The main parameter of interest in this data set is $Density
, which discretely categorizes each wildfire smoke plume as “Light”, “Medium”, or “Heavy”. Checking its class shows that $Density
is class character.
Nominal data, data without fixed order or rank system, can be stored as class character (i.e., State names). However, it is best to store ordinal data as class factor for conducting analyses in R.
Converting data from class character to class factor can be done with factor()
. The levels = c()
argument in the function specifies both the level names and the ranked order of the levels.
Check the class of $Density
again to ensure proper reclassification.
3.3.4 Coordinate Reference System and Projection
Check the coordinate reference system of an sf
object with sf::st_crs()
.
smoke_sf
has a native coordinate reference system which was imported during the sf::st_read()
step. The area of interest for these exploratory analyses is the conterminous United States, so we can transform smoke_sf
to the Albers Equal Area projected coordinate system (EPSG code: 5070).
3.4 Exploratory Analysis with sf
3.4.1 Plot (Single)
With the data prepared, plot the wildfire smoke plume polygons with ggplot2::ggplot()
.
Now that the parameters of interest and coordinate reference system have been prepared, create a plot with ggplot2::ggplot()
. Identifying the data set to be plotted within the geom_sf()
argument informs the function that the data is an sf
object.
ggplot() +
geom_sf(
data = smoke_sf,
aes(fill = Density)
) +
scale_fill_manual("Smoke Density",
values = c("lightgreen", "lightgoldenrod", "tomato")
) +
ggtitle("Wildfire Smoke Plumes") +
theme_pubr(legend = "bottom") +
theme(plot.title = element_text(hjust = 0.5)) +
grids()
The wildfire smoke plume polygons are clearly visible and colored according to their individual smoke density classification. This plot, however, is difficult to interpret for two reasons. First, there are multiple polygons for each smoke density classification. Multiple borders and overlapping polygons with the same smoke density type can be confusing. To make the polygons more clear, individual polygons for each smoke density classification can be combined.
For the purposes of these exploratory analyses, the satellite travelling direction and time of collection will be ignored.
3.4.2 Union
Individual polygons can be unioned (combined) into one multi-part polygon with sf::st_union
. The group_by(Density)
argument specifies that the polygons should be combined based on the value stored in $Density
. Adding the Date = paste0(...
argument within the dplyr::summarise()
function creates a parameter to store the date based on the year, month, and day of the data.
smoke_sf_density <-
smoke_sf %>%
group_by(Density) %>%
summarise(
geometry = st_union(geometry),
Date = paste0(
year,
month,
day
)
)
The resulting data set contains three multi-polygons, a column for the smoke plume classification, and a column for the date.
Creating a new plot with smoke_sf_density
.
ggplot() +
geom_sf(
data = smoke_sf_density,
aes(fill = Density)
) +
scale_fill_manual("Smoke Density",
values = c("lightgreen", "lightgoldenrod", "tomato")
) +
ggtitle("Wildfire Smoke Plumes (unioned)") +
theme_pubr(legend = "bottom") +
theme(plot.title = element_text(hjust = 0.5)) +
grids()
The plot is still difficult to interpret because it lacks geospatial context. The grid lines provide latitude and longitude references, but physical or geopolitical boundaries can help show where the wildfire smoke plumes are relative to the study area of interest. To provide geospatial context to the wildfire smoke plume polygons, we can add the United States state boundary polygons to the plot.
3.4.3 Plot (Multiple)
The steps taken to access, download, unzip, and import the United States state boundary data are the same as those taken for the wildfire smoke plume coverage data. Refer to Sections Access, Download, and Unzip and Import for detailed descriptions.
url_states <-
"https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip"
destination_states <- "/ YOUR FILE PATH /states.zip"
download.file(
url_states,
destination_states
)
Inspect the structure of states_sf
.
For the purpose of these exploratory analyses, only the conterminous United States (CONUS) will be used. Subset states_sf
to remove Alaska, Hawaii, and United States territories.
remove <- c(
"Alaska",
"Hawaii",
"Puerto Rico",
"United States Virgin Islands",
"Commonwealth of the Northern Mariana Islands",
"Guam",
"American Samoa"
)
conus_sf <- subset(
states_sf,
!NAME %in% remove
)
Check the coordinate reference system.
When analyzing multiple spatial data sets together, all data sets must have the same coordinate reference system or projected coordinate system.
Transform conus_sf
to match the projected coordinate system of the smoke_sf_density
data set.
Plot the conterminous United States state boundaries.
ggplot() +
geom_sf(data = conus_sf) +
ggtitle("Coterminous United States' State Boundaries") +
theme_pubr() +
theme(plot.title = element_text(hjust = 0.5)) +
grids()
With the wildfire smoke plume and conterminous United States polygons imported and prepared, ensure that they have the same coordinate reference system.
Create a plot which shows the distribution of wildfire smoke plumes over the conterminous United States state boundaries.
ggplot() +
geom_sf(
data = smoke_sf_density,
aes(fill = Density)
) +
scale_fill_manual("Smoke Density",
values = c("lightgreen", "lightgoldenrod", "tomato")
) +
geom_sf(
data = conus_sf,
fill = "transparent"
) +
ggtitle("Wildfire Smoke Plumes (with states)") +
theme_pubr(legend = "bottom") +
theme(plot.title = element_text(hjust = 0.5)) +
grids()
This plot provides important geospatial context for understanding where the wildfire smoke plumes are in relation to the study area of interest.
3.4.4 Crop
The sf::st_crop()
function can be used to reduce the extent of a set of polygons to a specific rectangle, typically the bounding box of another spatial data set. In this example we can crop the smoke_sf_density
polygons to the bounding box surrounding the conterminous United States.
Plot the cropped wildfire smoke plume polygons and the conterminous United States state boundaries.
ggplot() +
geom_sf(
data = smoke_sf_crop,
aes(fill = Density)
) +
scale_fill_manual("Smoke Density",
values = c("lightgreen", "lightgoldenrod", "tomato")
) +
geom_sf(
data = conus_sf,
fill = "transparent"
) +
ggtitle("Wildfire Smoke Plumes (cropped)") +
theme_pubr(legend = "bottom") +
theme(plot.title = element_text(hjust = 0.5)) +
grids()
3.5 Data Preparation with terra
This section will focus on exploratory analyses with polygon data using the terra
package (9).
3.5.1 Import
Now that the contents of the zip files have been saved on your machine and the data files of interest have been identified, import both the wildfire smoke plume coverage data and the United States state boundary data with terra::vect()
.
See Sections Access, Download, and Unzip and Plot (Multiple) for obtaining the wildfire smoke plume coverage and United States state boundary data sets, respectively.
3.5.2 Inspect Structure
Inspect the structures of smoke_t
and states_t
to see their classes, column names, and column classes.
Both smoke_t
and states_t
have native spatial features. These are reflected by the type of spatial data in geometry:
, and the spatial attributes extent:
and coord. ref.:
3.5.3 Reclassify
The main parameter of interest in this data set is $Density
, which discretely categorizes each wildfire smoke plume as “Light”, “Medium”, or “Heavy”. Checking its class shows that $Density
is class character.
Nominal data, data without fixed order or rank system, can be stored as class character (i.e., State names). However, it is best to store ordinal data as class factor for conducting analyses in R.
Converting data from class character to class factor can be done with factor()
. The levels = c()
argument in the function specifies both the level names and the ranked order of the levels.
Check the class of $Density
again to ensure proper reclassification.
For the purpose of these exploratory analyses, only the conterminous United States (CONUS) will be used. Subset states_t
to remove Alaska, Hawaii, and the United States territories.
3.5.4 Coordinate Reference System and Projection
Check the coordinate reference systems of SpatVector
objects with terra::crs()
.
Both data sets have native coordinate reference systems which were imported during the terra::vect()
step. The two data sets, however, have different coordinate reference systems from each other. The area of interest for these exploratory analyses is the conterminous United States, so we can project smoke_t
and conus_t
to the Albers Equal Area projected coordinate system (EPSG code: 5070).
Although both data sets were transformed to the same projected coordinate system, it is important to ensure that all data sets have the same coordinate reference system or projected coordinate system.
3.6 Exploratory Analysis with terra
3.6.1 Plot (Multiple)
Plot both data sets together in one plot with ggplot2::ggplot()
.
Now that the parameters of interest and coordinate reference systems have been prepared, create a plot with ggplot2::ggplot()
. Identifying the data sets to be plotted within the tidyterra::geom_spatvector()
arguments informs the function that the data are SpatVector
objects (10).
ggplot() +
geom_spatvector(
data = smoke_t,
aes(fill = Density)
) +
scale_fill_manual("Smoke Density",
values = c("lightgreen", "lightgoldenrod", "tomato")
) +
geom_spatvector(
data = conus_t,
fill = "transparent"
) +
ggtitle("Wildfire Smoke Plumes (with states)") +
theme_pubr(legend = "bottom") +
theme(plot.title = element_text(hjust = 0.5)) +
grids()
The wildfire smoke plume polygons are clearly visible and colored according to their individual smoke density classification. The plot, however, is difficult to interpret because there are multiple polygons for each smoke density classification. Multiple borders and overlapping polygons with the same smoke density type can be confusing. To make the polygons more clear, individual polygons for each smoke density classification can be combined.
For the purposes of these exploratory analyses, the satellite travelling direction and time of collection will be ignored.
3.6.2 Aggregate
Individual polygons an be aggregated (combined) into one multi-part polygon with terra::aggregate()
. The by = "Density"
argument specifies that the polygons should be combined based on the value stored in $Density
.
Aggregating the polygons based on the values stored in the $Density
column can result in the other columns containing NA
values. To remove these columns, subset smoke_t_density
to remove $Satellite
, $Start
, and $End
.
The resulting data set contains three multi-polygons, a column for the smoke plume classification, and a count of the number of individual polygons that were aggregated to create the multi-polygon. This last column, $agg_n
is automatically calculated by the terra::aggregate()
function.
Create a new plot with smoke_t_density
.
ggplot() +
geom_spatvector(
data = smoke_t_density,
aes(fill = Density)
) +
scale_fill_manual("Smoke Density",
values = c("lightgreen", "lightgoldenrod", "tomato")
) +
geom_spatvector(
data = conus_t,
fill = "transparent"
) +
ggtitle("Wildfire Smoke Plumes (aggregated)") +
theme_pubr(legend = "bottom") +
theme(plot.title = element_text(hjust = 0.5)) +
grids()
3.6.3 Crop
The terra::crop()
function can be used to reduce SpatVector
to an area determined by another SpatVector
. In this example, we can crop the smoke_t_density
polygons to the conterminous United States state boundaries.
Plot the cropped wildfire smoke plume polygons and the conterminous United States state boundaries.
ggplot() +
geom_spatvector(
data = smoke_t_crop,
aes(fill = Density)
) +
scale_fill_manual("Smoke Density",
values = c("lightgreen", "lightgoldenrod", "tomato")
) +
geom_spatvector(
data = conus_t,
fill = "transparent"
) +
ggtitle("Wildfire Smoke Plumes (cropped)") +
theme_pubr(legend = "bottom") +
theme(
plot.title = element_text(hjust = 0.5),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank()
)
3.6.4 Zonal Statistics
Looking closely at the previous plot, it is clear that wildfire smoke plumes cover each state differently. The terra
package can be used to identify which states are covered by each classification of wildfire smoke plumes.
The terra::relate()
function can be used to identify spatial relationships between two SpatVector
objects. The relation = "intersects"
argument logically identifies if any portion of each state is or is not covered by each of the three wildfire smoke plume classification multi-polygons.
The output of terra::relate()
is a wide matrix. The nested data.frame()
and t()
wrappers convert the output from a wide matrix to a long data frame, which is required to combine the results with the conus_t
data set.
Set the column names of conus_smoke
to match the smoke density classifications.
The order of the columns in conus_smoke
are based on the ordered factor levels in smoke_t_density$Density
(see Section Reclassify).
Combine the wildfire smoke plume indicator data frame with the the conterminous United States state boundaries data.
The conus_t
data set now contains separate columns indicating the presence or absence of “Light”, “Medium”, and “Heavy” wildfire smoke plumes for each conterminous state.
3.6.5 Plot (for
Loop)
A for
loop can be used to create indicator plots for each wildfire smoke plume classification. The layout of a for
loop can look complicated, but it simply applies the same set of functions to a given list of inputs.
The list of inputs must first be created. As the goal is to plot each of the wildfire smoke plume density classifications, create a character vector of the three classification names.
This “list of inputs” must first be created. Store the three wildfire smoke plume classifications in a vector of class character.
Create a for
loop that creates a plot for each wildfire smoke plume density stored within dens_c
.
Code line 1 tells the for
loop to apply the following functions to each observation in dens_c
.
Code lines 3 through 9 define the plotting colors based on the wildfire smoke plume classification (dens_c[d]
). As in previous plots, “Light” smoke plumes will be colored green, “Medium” smoke plumes will be covered yellow, and “Heavy” smoke plumes will be colored red.
Code lines 12 through 32 create the plot based on the wildfire smoke plume classification (dens_c[d]
), and previously defined plotting colors (color_values
).
for (d in seq_along(dens_c)) {
# define color palette based on smoke density
if (dens_c[d] == "Light") {
color_values <- c("lightgrey", "lightgreen")
} else if (dens_c[d] == "Medium") {
color_values <- c("lightgrey", "lightgoldenrod")
} else if (dens_c[d] == "Heavy") {
color_values <- c("lightgrey", "tomato")
}
# create plot
print(
ggplot() +
geom_spatvector(
data = conus_t,
aes_string(fill = dens_c[d])
) +
scale_fill_manual(
paste0(
dens_c[d],
" Smoke Plume Coverage Present"
),
values = color_values
) +
theme_pubr(legend = "bottom") +
theme(
plot.title = element_text(hjust = 0.5),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank()
)
)
}