1 Getting Started

Profile-CMP Profile-SBS Profile-STU

Getting Started with Geospatial Data Analysis in Environmental Health Using R

Date Modified: June 25, 2024

Authors: Mitchell Manware author-mm, Lara P. Clark author-lpc, Kyle P. Messier author-kpm

Key Terms: Environmental Health, Geospatial Data

Programming Language: R

1.1 Introduction

1.1.1 Motivation

Environmental health research relies on various types of data to accurately measure, model, and predict exposures. Environmental data are often spatial (related to the surface of the Earth), temporal (related to a specific time or period of time), or spatio-temporal (related to the surface of the Earth for a specific time or period of time). Here, the term geospatial will be used to refer to spatial and spatio-temporal data. These data are at the core of environmental health research, but the steps between identifying a geospatial data set or variable and using it to help answer a research question can be challenging.

1.1.2 Objectives

The objectives of this chapter are to:

  1. Introduce concepts and terminology used in the following tutorials for geospatial data and geospatial analysis methods.
  2. Describe the geospatial datasets and R packages used in the following tutorials.
  3. Provide a list of useful resources for getting started with R and for further exploration of geospatial data analysis methods in environmental health.

The following chapters in this unit will demonstrate how to use R to access, prepare, and analyze different types of geospatial data that are commonly used in environmental health applications. The tutorials will focus primarily on spatial data, but some aspects of temporal and spatio-temporal data will also be discussed.

1.2 Concepts and Terminology

1.2.1 Spatial Geometry

The spatial geometry of a geospatial dataset is an important consideration in data analysis pipelines. There are three main spatial geometry types: point, line, and area (i.e., polygon or grid). Points are represented by geographic coordinates (latitude and longitude pairs), lines by a series of connected points, and polygons by a series of connected points that completely enclose and define an area. In contrast to polygons, which can define irregular or non-uniform areas, grids define regular and uniform areas (e.g., such that each grid cell has the same area). Point, line, and polygon data is referred to as vector data, and grid data is referred to as raster data. For detailed descriptions of vector data, raster data, and the differences between them, respectively, see (1), (2) and (3).

The following table illustrates common examples of each spatial geometry type used in environmental health applications.

Spatial Geometry Types
Type Illustration Examples Tutorials
Point (Vector) Air pollution monitors, Weather stations, Patient geocoded addresses, Healthcare facility coordinates Point Data
Line (Vector) Roads, Commute routes
Polygon (Vector) Wildfire smoke plumes, Census boundaries Polygon Data
Grid (Raster) Land cover imagery from satellites, Meteorological model output, Gridded population counts Raster Data

The tutorials linked in the table above demonstrate exploratory analyses with each spatial geometry data type.

1.2.2 Coordinate Reference Systems and Projections

Coordinate reference systems (CRS) are important for spatial analyses as they define how spatial data align with the Earth’s surface (4). Transforming (projecting) the data to a different CRS may be necessary when combining multiple datasets or creating visuals for particular areas of interest. It is important to note that transforming spatial data can cause distortions in its area, direction, distance, or shape (4). The direction and magnitude of these distortions vary depending on the chosen CRS, area of interest, and type of data (5). For guidance on selecting an appropriate CRS based on the data, area of interest, and analysis goals, see (6,7).

1.3 Datasets

The tutorials in this unit demonstrate the use of geospatial data using the following publicly available datasets:

Data Provider Dataset Type
Environmental Protection Agency (EPA) PM2.5 Daily Observations Point
National Oceanic and Atmospheric Administration (NOAA) Wildfire Smoke Plumes Polygon
United States Census Bureau United States Cartographic Boundary Polygon
National Oceanic and Atmospheric Administration (NOAA) Land Surface Temperature Raster

1.4 R Packages

The tutorials in this unit demonstrate the use of the following R packages:

The following code installs and imports the packages used in this unit:

Installing and importing new packages may required R to restart.

vignette_packages <- c(
  "dplyr", "ggplot2", "ggpubr", "sf",
  "terra", "tidyterra", "utils"
)

for (v in seq_along(vignette_packages)) {
  if (vignette_packages[v] %in% installed.packages() == FALSE) {
    install.packages(vignette_packages[v])
  }
}
## Installing package into '/home/runner/work/_temp/Library'
## (as 'lib' is unspecified)
## also installing the dependencies 'rbibutils', 'Deriv', 'microbenchmark', 'Rdpack', 'numDeriv', 'doBy', 'SparseM', 'MatrixModels', 'minqa', 'nloptr', 'reformulas', 'carData', 'Formula', 'pbkrtest', 'quantreg', 'lme4', 'corrplot', 'car', 'ggrepel', 'ggsci', 'ggsignif', 'polynom', 'rstatix'
## Installing package into '/home/runner/work/_temp/Library'
## (as 'lib' is unspecified)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)
library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
library(terra)
## terra 1.8.10
## 
## Attaching package: 'terra'
## The following object is masked from 'package:ggpubr':
## 
##     rotate
## The following object is masked from 'package:BiocGenerics':
## 
##     width
library(tidyterra)
## 
## Attaching package: 'tidyterra'
## The following object is masked from 'package:stats':
## 
##     filter
library(utils)

1.5 Resources

This section highlights resources for getting started with R, geospatial data analysis, and/or climate change and human health related research methods.

  • The BUSPH-HSPH Climate Change and Health Research Coordinating Center (CAFÉ) provides training and education materials for climate change and human health research in different formats for various types of users. The Climate CAFÉ Tutorials and Code Walkthroughs demonstrate geospatial data management and analysis in climate change and human health research using R. CAFÉ also provides a series of video tutorials demonstrating the use of geographic information systems (GIS) in environmental health and a list of educational materials on climate and health.

  • The inTelligence And Machine lEarning (TAME) Toolkit provides tutorials for data generation, management, and analysis in environmental health research using R. The TAME Toolkit Chapter 1 includes a guide for installing and getting started with R and an introduction to data science methods in R. The TAME Toolkit also includes tutorials with R code demonstrating geospatial data analysis methods in environmental health (e.g., Chapter 3.3).

  • The IPUMS DHS Climate Change and Health Research Hub provides tutorials with code in R demonstrating use of various climate and health datasets and analysis methods. IPUMS also provides a guide to installing and setting up R for use in climate change and health research.

  • The book Geocomputation with R provides resources for geospatial data analysis, visualization, and modeling with R. This book provides tutorials and examples from various disciplines that use geospatial data (e.g., transportation, ecology). This book covers introductory through advanced topics.

References

1.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009
2.
3.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. “Geographic Data in R.” In Geocomputation with R. Chapman; Hall/CRC. https://r.geocompx.org/spatial-class#spatial-class
4.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. “Coordinate Reference Systems.” In Geocomputation with R. Chapman; Hall/CRC. https://r.geocompx.org/spatial-class#crs-intro
5.
Steinwand, Daniel R, John A Hutchinson, and John P Snyder. 1995. “Map Projections for Global and Gontinental Data Sets and an Analysis of Pixel Distortion Caused by Reproiection.” Photogrammetric Engineering & Remote Sensing 61 (12): 1487–97
6.
Esri. 2023. Choose the Right Projection. Website. https://learn.arcgis.com/en/projects/choose-the-right-projection/
7.
United States Geological Survey (USGS). 2019. Map Projections. Website. https://pubs.usgs.gov/gip/70047422/report.pdf
8.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016
9.
Hijmans, Robert J. 2024. terra: Spatial Data Analysis. R Package Version 1.7-71. https://CRAN.R-project.org/package=terra
10.
Hernangómez, Diego. 2023. “Using the tidyverse with terra Objects: The tidyterra Package.” Journal of Open Source Software 8 (91): 5751. https://doi.org/10.21105/joss.05751
11.
Wickham, Hadley. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org
12.
Kassambara, Alboukadel. 2023. ggpubr: ’ggplot2’ Based Publication Ready Plots. R Package Version 0.6.0. https://CRAN.R-project.org/package=ggpubr
13.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. dplyr: A Grammar of Data Manipulation. R Package Version 1.1.4. https://CRAN.R-project.org/package=dplyr