download_data
Mitchell Manware
2024-07-09
Source:vignettes/download_functions.Rmd
download_functions.Rmd
Motivation
The download_data
function was developed to improve
researchers’ access to publicly available environmental data. Although
the data are already available online, using a web browser to manually
download hundreds or thousands of data files is slow, arduous, and not
efficiently repeatable. Additionally, as users may not be familiar with
creating download recipes in Bash (Unix shell),
download_data
allows researchers to download data directly
with R
, a common coding language in the field of
environmental health research. Finally, function-izing data downloads is
useful for repeated code or automated analysis pipelines.
download_data
download_data
is acccesses and downloads environmental
datasets, collections, and variables from a variety of sources. This
wrapper function calls source-specific data download functions, each
utilizing a unique combination of input parameters, host URL, naming
convention, and data formats.
It is important to note that download_data
calls a
source-specific function based on the dataset_name
parameter. Using the source-specific function directly will return the
same data if the parameters are the same, but the error
messages produced by each differ slightly.
Parameters
User-defined parameters differ based on the data source. Required
parameters for each source can be checked with
names(formals())
.
names(formals(download_hms))
#> [1] "data_format" "date" "directory_to_save"
#> [4] "acknowledgement" "download" "remove_command"
#> [7] "unzip" "remove_zip" "hash"
names(formals(download_narr))
#> [1] "variables" "year" "directory_to_save"
#> [4] "acknowledgement" "download" "remove_command"
#> [7] "hash"
The two functions have different required parameters because
download_hms
uses a daily temporal resolution while
download_narr
uses yearly, but they share some common,
standard parameters.
Standard parameters
Four parameters are included in all of the data download functions.
Parameter | Type | Description |
---|---|---|
directory_to_save | Character | There must be a directory to save downloaded data. Default = ‘./input/DATASET_NAME/’. |
acknowledgement | Logical | User must acknowledge that downloading geospatial data can be very lage and may use lots of machine storage and memory. |
download | Logical | Run or skip the data download. Utilized primarily for unit tests (see Unit Tests). |
remove_command | Logical | Remove or retain the text file containing the generated download commands. Utilized primarily for unit tests (see Unit Tests and 4. Initiate “…commands.txt”). |
Additionally, the dataset_name
parameter must be
specified when using download_data
, but is assumed when
using a source-specific download function.
Function Structure
Although each source-specific download function is unique, they all
follow the same general structure. The following chunks of code have
been adopted from download_hms
to demonstrate the
functions’ structure.
3. Generate download file names
5. Concatenate download commands
7. Run commands in “…commands.txt”
1. Clean parameters
Cleaning the user-defined parameters is highly dependent on the
parameters themselves and the desired URL to be created. A common
parameter cleaning step is creating a date-time sequence based on a
given temporal range and required format, in this case
YYYYMMDD
.
# user defined parameters
dates <- c("2023-12-28", "2024-01-02")
date_sequence <- seq(
as.Date(dates[1], format = "%Y-%m-%d"),
as.Date(dates[2], format = "%Y-%m-%d"),
"day"
)
date_sequence <- gsub("-", "", as.character(date_sequence))
date_sequence
#> [1] "20231228" "20231229" "20231230" "20231231" "20240101" "20240102"
2. Generate download URLs
The URL base and pattern are identified by manually inspecting the
download link on the source-specific web page. download_hms
utilizes the year, month, date, and data format to generate the download
url.
# user defined parameters
data_format <- "Shapefile"
suffix <- ".zip"
urls <- NULL
for (d in seq_along(date_sequence)) {
year <- substr(date_sequence[d], 1, 4)
month <- substr(date_sequence[d], 5, 6)
base <- "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/"
url <- paste0(
base,
data_format,
"/",
year,
"/",
month,
"/hms_smoke",
date_sequence[d],
suffix
)
urls <- c(urls, url)
}
urls
#> [1] "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2023/12/hms_smoke20231228.zip"
#> [2] "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2023/12/hms_smoke20231229.zip"
#> [3] "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2023/12/hms_smoke20231230.zip"
#> [4] "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2023/12/hms_smoke20231231.zip"
#> [5] "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2024/01/hms_smoke20240101.zip"
#> [6] "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2024/01/hms_smoke20240102.zip"
A download URL is created for each date in date_sequence
based on the fixed pattern.
3. Generate download file names
The generation of download file names also follows a fixed pattern, typically a combination of the user-defined download directory, dataset name, spatiotemporal characteristic, data type, and, if applicable, specific variable name. Unlike the download URLs, the download file names can be defined in any way by the writer of the function, but using the previously defined characteristics is useful for identification.
# user defined parameters
directory_to_download <- "./data/"
download_file_names <- NULL
for (d in seq_along(date_sequence)) {
download_file_name <- paste0(
directory_to_download,
"hms_smoke_",
data_format,
"_",
date_sequence[d],
suffix
)
download_file_names <- c(download_file_names, download_file_name)
}
download_file_names
#> [1] "./data/hms_smoke_Shapefile_20231228.zip"
#> [2] "./data/hms_smoke_Shapefile_20231229.zip"
#> [3] "./data/hms_smoke_Shapefile_20231230.zip"
#> [4] "./data/hms_smoke_Shapefile_20231231.zip"
#> [5] "./data/hms_smoke_Shapefile_20240101.zip"
#> [6] "./data/hms_smoke_Shapefile_20240102.zip"
A download URL is created for each date in date_sequence
based on the fixed pattern.
4. Initiate “…commands.txt”
An important aspect of the data download function is its
sink...cat...sink
structure. Rather than using the
utils::download.file
function, a text file is created to
store all of the download commands generated from the URLs and file
names.
This structure is utilized for several reasons:
Consistent structure for all the source-specific download functions.
The
download.file
function cannot accept vectors of URLs and destination files for downloading. An additionalfor
loop to download data will increase function complexity and may reduce performance.Writing commands in Bash (Unix shell) script allows for specific arguments and flags.
Storing the download URLs without immediately running the download allows for unit testing and URL checking (more on this in Unit Tests).
The text file containing the download commands is named based on the dataset, temporal range, and data transfer method.
commands_txt <- paste0(
directory_to_download,
"hms_smoke_",
head(date_sequence, n = 1),
"_",
tail(date_sequence, n = 1),
"_curl_commands.txt"
)
Create and sink the text file.
sink(commands_txt)
5. Concatenate download commands
The Linux-based download commands are written according to the data transfer method, download URL, download file name, and additional arguments. Which additional arguments are included, and their order, depend on the data transfer method and URL type.
For more information on curl
and wget
, the
two data transfer methods utilized by the data download functions, see
curl.1 the man page and
GNU Wget
1.21.1-dirty Manual (latest version as of January 8, 2024).
The cat()
function will store each of the download
commands written in the for
loop to the previously sunk
commands text file (commands_txt
).
6. Finalize “…commands.txt”
After the download commands have been concatenated to the commands
text file, a second sink
command is run to finalize the
file and stop the appending of R output.
sink()
7. Run commands in “…commands.txt”
A “system command” must be created to run all of the download
commands stored in the commands text file. In bash script,
.
indicates to run all of the commands within a given
script. In this case, we will run all of the commands within the
commands text file.
system_command <- paste0(
". ",
commands_txt,
"\n"
)
system_command
#> [1] ". ./data/hms_smoke_20231228_20240102_curl_commands.txt\n"
Running the system_command
deploys an “auxiliary”
function, download_run
, a function created to reduce
repeated code across the source-specific download functions. The
function takes two parameters, system_command
, which
indicates the command to be run, and download
, a
user-defined logical parameter.
download_run <- function(
download = FALSE,
system_command = NULL) {
if (download == TRUE) {
cat(paste0("Downloading requested files...\n"))
system(command = system_command)
cat(paste0("Requested files have been downloaded.\n"))
} else {
cat(paste0("Skipping data download.\n"))
return(NULL)
}
}
The data download is initiated by running download_run
with the system command identified and download = TRUE
.
download_run(
download = TRUE,
system_command = system_command
)
Checking the download directory shows that all of the requested files have been downloaded.
list.files(path = directory_to_download)
#> [1] "hms_smoke_Shapefile_20231228.zip" "hms_smoke_Shapefile_20231229.zip"
#> [3] "hms_smoke_Shapefile_20231230.zip" "hms_smoke_Shapefile_20231231.zip"
#> [5] "hms_smoke_Shapefile_20240101.zip" "hms_smoke_Shapefile_20240102.zip"
8. Zip files (if applicable)
All of the source-specific data download functions follow this
general pattern, but those functions which download zip files require
additional steps to inflate and remove the downloaded zip files, if
desired. Each of these two steps are run by helper functions, and they
are run by the user-defined unzip
and
remove_zip
parameters in download_data
.
download_unzip
inflates zip files if
unzip = TRUE
, and skips inflation if
unzip = FALSE
.
download_unzip <-
function(file_name,
directory_to_unzip,
unzip = TRUE) {
if (!unzip) {
cat(paste0("Downloaded files will not be unzipped.\n"))
return(NULL)
}
cat(paste0("Unzipping files...\n"))
unzip(file_name,
exdir = directory_to_unzip
)
cat(paste0(
"Files unzipped and saved in ",
directory_to_unzip,
".\n"
))
}
download_remove_zips
removes the downloaded zip files if
remove = TRUE
, and skips removal if
remove = FALSE
.
download_remove_zips <-
function(remove = FALSE,
download_name) {
if (remove) {
cat(paste0("Removing download files...\n"))
file.remove(download_name)
cat(paste0("Download files removed.\n"))
}
}
For this demonstration we will unzip (inflate) the downloaded zip files but we will not delete them.
for (f in seq_along(download_file_names)) {
download_unzip(
file_name = download_file_names[f],
directory_to_unzip = directory_to_download,
unzip = TRUE
)
}
download_remove_zips(
download_name = download_file_names,
remove = FALSE
)
#> Unzipping files...
#> Files unzipped and saved in ./data/.
#> Unzipping files...
#> Files unzipped and saved in ./data/.
#> Unzipping files...
#> Files unzipped and saved in ./data/.
#> Unzipping files...
#> Files unzipped and saved in ./data/.
#> Unzipping files...
#> Files unzipped and saved in ./data/.
#> Unzipping files...
#> Files unzipped and saved in ./data/.
Listing the files again shows that the contents of the zip files have been inflated and the zip files have been retained.
list.files(path = directory_to_download)
#> [1] "hms_smoke_Shapefile_20231228.zip" "hms_smoke_Shapefile_20231229.zip"
#> [3] "hms_smoke_Shapefile_20231230.zip" "hms_smoke_Shapefile_20231231.zip"
#> [5] "hms_smoke_Shapefile_20240101.zip" "hms_smoke_Shapefile_20240102.zip"
#> [7] "hms_smoke20231228.dbf" "hms_smoke20231228.prj"
#> [9] "hms_smoke20231228.shp" "hms_smoke20231228.shx"
#> [11] "hms_smoke20231229.dbf" "hms_smoke20231229.prj"
#> [13] "hms_smoke20231229.shp" "hms_smoke20231229.shx"
#> [15] "hms_smoke20231230.dbf" "hms_smoke20231230.prj"
#> [17] "hms_smoke20231230.shp" "hms_smoke20231230.shx"
#> [19] "hms_smoke20231231.dbf" "hms_smoke20231231.prj"
#> [21] "hms_smoke20231231.shp" "hms_smoke20231231.shx"
#> [23] "hms_smoke20240101.dbf" "hms_smoke20240101.prj"
#> [25] "hms_smoke20240101.shp" "hms_smoke20240101.shx"
#> [27] "hms_smoke20240102.dbf" "hms_smoke20240102.prj"
#> [29] "hms_smoke20240102.shp" "hms_smoke20240102.shx"
The download function was structured successfully.
Unit Tests
The previous outline successfully cleaned parameters, generated URLs, and downloaded data, but how can we be sure that it will continue to work with different temporal ranges and data types? To this end, unit tests have been implemented to ensure that each data download function runs properly and that URLs produced by 2. Generate download URLs are valid and accessible. Like the download functions, the unit tests rely on “helper” functions to reduce repeated code across the tests.
Helper functions
read_commands
imports the commands text file and
converts the data frame to a vector.
read_commands <- function(
commands_path = commands_path) {
commands <- utils::read.csv(commands_path, header = FALSE)
commands <- commands[seq_len(nrow(commands)), ]
return(commands)
}
extract_urls
extracts each download URL from the vector
of commands. The position
of the URL within the download
command is determined in 5.
Concatenate download commands.
# function to extract URLs from vector
extract_urls <- function(
commands = commands,
position = NULL) {
if (is.null(position)) {
cat(paste0("URL position in command is not defined.\n"))
return(NULL)
}
url_list <- NULL
for (c in seq_along(commands)) {
url <- stringr::str_split_i(commands[c], " ", position)
url_list <- c(url_list, url)
}
return(url_list)
}
check_url_status
is the most important of the download
test “helper” functions. This function utilizes httr::HEAD
and httr::GET
to check the HTTP response status of a given
URL. The desired HTTP response status is 200, which means the URL is
valid and accessible. check_url_status
returns a logical
value to indicate whether the URL returns HTTP status 200
(TRUE
) or not (FALSE
). For more information on
HTTP status’, see HTTP
response status codes.
check_url_status <- function(
url,
method = "HEAD") {
http_status_ok <- 200
if (method == "HEAD") {
hd <- httr::HEAD(url)
} else if (method == "GET") {
hd <- httr::GET(url)
}
status <- hd$status_code
return(status == http_status_ok)
}
check_urls
applies check_url_status
to a
random sample of URLs extracted by extract_urls
. The sample
size will vary based on the dataset and spatio-temporal parameters being
tested. The function returns a logical vector containing the output from
check_url_status
.
check_urls <- function(
urls = urls,
size = NULL,
method = "HEAD") {
if (is.null(size)) {
cat(paste0("URL sample size is not defined.\n"))
return(NULL)
}
if (length(urls) < size) {
size <- length(urls)
}
url_sample <- sample(urls, size, replace = FALSE)
url_status <- sapply(url_sample,
check_url_status,
method = method
)
return(url_status)
}
testthat
To demonstrate a test in action, test the URLs generated by
download_data
for the NOAA HMS Smoke dataset.
For more information see testthat.
library(testthat)
testthat::test_that(
"Valid dates return HTTP response status = 200.",
{
# parameters
test_start <- "2023-12-28"
test_end <- "2024-01-02"
test_directory <- "./data/"
# download
download_data(
dataset_name = "hms",
date = c(test_start, test_end),
data_format = "Shapefile",
directory_to_save = test_directory,
acknowledgement = TRUE,
download = FALSE,
remove_command = FALSE,
unzip = FALSE,
remove_zip = FALSE
)
commands_path <- paste0(
test_directory,
"hms_smoke_",
gsub("-", "", test_start),
"_",
gsub("-", "", test_end),
"_curl_commands.txt"
)
# helpers
commands <- read_commands(commands_path = commands_path)
urls <- extract_urls(commands = commands, position = 6)
url_status <- check_urls(urls = urls, size = 6, method = "HEAD")
# test for true
expect_true(all(url_status))
}
)
#> Test passed 🎊
Although the testthat::test_that(...)
chunk contains 32
lines of code, the unit test is performed by
expect_true(all(url_status))
. In words, this line is
expecting (expect_true
) that all (all
) of the
sampled URLs return HTTP response status 200 (url_status
).
Since this expectation was met, the test passed!
For an alternate example, we can use a start and end date that are
known to not have data. As the URLs associated with these dates do not
exist, we expect the function will fail. This test utilizes
expect_error()
because the download_data
wrapper function returns an error message if the underlying
source-specific download function returns an error.
testthat::test_that(
"Invalid dates cause function to fail.",
{
# parameters
test_start <- "1800-01-01"
test_end <- "1800-01-02"
test_directory <- "../inst/extdata/"
# test for error
testthat::expect_error(
download_data(
dataset_name = "hms",
date = c(test_start, test_end),
data_format = "Shapefile",
directory_to_download = test_directory,
directory_to_save = test_directory,
acknowledgement = TRUE,
download = FALSE,
remove_command = FALSE,
unzip = FALSE,
remove_zip = FALSE
)
)
}
)
#> Test passed 🎊
This test utilizes testthat::expect_error
because the
download_data
wrapper function returns an error message if
the underlying source-specific download function returns an error. If we
directly used the download_hms
function, we would expect
and receive an error.
testthat::test_that(
"Invalid dates cause function to fail.",
{
# parameters
test_start <- "1800-01-01"
test_end <- "1800-01-02"
test_directory <- "../inst/extdata/"
# test for error
testthat::expect_error(
download_hms(
date = c(test_start, test_end),
data_format = "Shapefile",
directory_to_download = test_directory,
directory_to_save = test_directory,
acknowledgement = TRUE,
download = FALSE,
remove_command = FALSE,
unzip = FALSE,
remove_zip = FALSE
)
)
}
)
#> Test passed 🌈
As expected, the test passes because the NOAA HMS Smoke dataset does not contain data for January 1-2, 1800.
These unit tests are just two of many implemented on
download_data
and the accompanying source-specific download
functions, but they demonstrate how unit testing helps build stable
code.
Download Example
With the function structure outlined and the unit tests in place, we can now perform a data download. To begin, check the parameters required by the source-specific data download function.
names(formals(download_hms))
#> [1] "data_format" "date" "directory_to_save"
#> [4] "acknowledgement" "download" "remove_command"
#> [7] "unzip" "remove_zip" "hash"
Define the parameters.
dates <- c("2023-12-28", "2024-01-02")
data_format <- "Shapefile"
data_directory <- "./download_example/"
acknowledgement <- TRUE
download <- TRUE # run data download
remove_command <- TRUE # delete "...commands.txt" file
unzip <- TRUE # inflate (unzip) downloaded zip files
remove_zip <- FALSE # retain downloaded zip files
Download the data.
download_data(
dataset_name = "hms",
date = dates,
directory_to_save = data_directory,
acknowledgement = acknowledgement,
download = download,
remove_command = remove_command,
unzip = unzip,
remove_zip = remove_zip
)
#> Downloading requested files...
#> Requested files have been downloaded.
#> Unzipping files...
#> Files unzipped and saved in ./download_example/.
#> Unzipping files...
#> Files unzipped and saved in ./download_example/.
#> Unzipping files...
#> Files unzipped and saved in ./download_example/.
#> Unzipping files...
#> Files unzipped and saved in ./download_example/.
#> Unzipping files...
#> Files unzipped and saved in ./download_example/.
#> Unzipping files...
#> Files unzipped and saved in ./download_example/.
Checking the directory shows that all of our desired data has been downloaded and inflated, and the original zip files have been retained.
list.files(data_directory)
#> [1] "hms_smoke_Shapefile_20231228.zip" "hms_smoke_Shapefile_20231229.zip"
#> [3] "hms_smoke_Shapefile_20231230.zip" "hms_smoke_Shapefile_20231231.zip"
#> [5] "hms_smoke_Shapefile_20240101.zip" "hms_smoke_Shapefile_20240102.zip"
#> [7] "hms_smoke20231228.dbf" "hms_smoke20231228.prj"
#> [9] "hms_smoke20231228.shp" "hms_smoke20231228.shx"
#> [11] "hms_smoke20231229.dbf" "hms_smoke20231229.prj"
#> [13] "hms_smoke20231229.shp" "hms_smoke20231229.shx"
#> [15] "hms_smoke20231230.dbf" "hms_smoke20231230.prj"
#> [17] "hms_smoke20231230.shp" "hms_smoke20231230.shx"
#> [19] "hms_smoke20231231.dbf" "hms_smoke20231231.prj"
#> [21] "hms_smoke20231231.shp" "hms_smoke20231231.shx"
#> [23] "hms_smoke20240101.dbf" "hms_smoke20240101.prj"
#> [25] "hms_smoke20240101.shp" "hms_smoke20240101.shx"
#> [27] "hms_smoke20240102.dbf" "hms_smoke20240102.prj"
#> [29] "hms_smoke20240102.shp" "hms_smoke20240102.shx"
download_hms
function code
The following is the entire R code used to create
download_hms
.
download_hms
#> function (data_format = "Shapefile", date = c("2018-01-01", "2018-01-01"),
#> directory_to_save = NULL, acknowledgement = FALSE, download = FALSE,
#> remove_command = FALSE, unzip = TRUE, remove_zip = FALSE,
#> hash = FALSE)
#> {
#> download_permit(acknowledgement = acknowledgement)
#> check_for_null_parameters(mget(ls()))
#> if (length(date) == 1)
#> date <- c(date, date)
#> stopifnot(length(date) == 2)
#> date <- date[order(as.Date(date))]
#> directory_original <- download_sanitize_path(directory_to_save)
#> directories <- download_setup_dir(directory_original, zip = TRUE)
#> directory_to_download <- directories[1]
#> directory_to_save <- directories[2]
#> if (unzip == FALSE && remove_zip == TRUE) {
#> stop(paste0("Arguments unzip = FALSE and remove_zip = TRUE are not ",
#> "acceptable together. Please change one.\n"))
#> }
#> date_sequence <- generate_date_sequence(date[1], date[2],
#> sub_hyphen = TRUE)
#> base <- "https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/"
#> commands_txt <- paste0(directory_original, "hms_smoke_",
#> utils::head(date_sequence, n = 1), "_", utils::tail(date_sequence,
#> n = 1), "_curl_commands.txt")
#> download_sink(commands_txt)
#> download_names <- NULL
#> for (f in seq_along(date_sequence)) {
#> year <- substr(date_sequence[f], 1, 4)
#> month <- substr(date_sequence[f], 5, 6)
#> if (tolower(data_format) == "shapefile") {
#> data_format <- "Shapefile"
#> suffix <- ".zip"
#> directory_to_cat <- directory_to_download
#> }
#> else if (tolower(data_format) == "kml") {
#> data_format <- "KML"
#> suffix <- ".kml"
#> directory_to_cat <- directory_to_save
#> }
#> url <- paste0(base, data_format, "/", year, "/", month,
#> "/hms_smoke", date_sequence[f], suffix)
#> if (f == 1) {
#> if (!(check_url_status(url))) {
#> sink()
#> file.remove(commands_txt)
#> stop(paste0("Invalid date returns HTTP code 404. ",
#> "Check `date` parameter.\n"))
#> }
#> }
#> destfile <- paste0(directory_to_cat, "hms_smoke_", data_format,
#> "_", date_sequence[f], suffix)
#> download_names <- c(download_names, destfile)
#> command <- paste0("curl -s -o ", destfile, " --url ",
#> url, "\n")
#> if (check_destfile(destfile)) {
#> cat(command)
#> }
#> }
#> sink()
#> download_run(download = download, commands_txt = commands_txt,
#> remove = remove_command)
#> if (data_format == "KML") {
#> unlink(directory_to_download, recursive = TRUE)
#> message(paste0("KML files cannot be unzipped.\n"))
#> return(TRUE)
#> }
#> for (d in seq_along(download_names)) {
#> download_unzip(file_name = download_names[d], directory_to_unzip = directory_to_save,
#> unzip = unzip)
#> }
#> download_remove_zips(remove = remove_zip, download_name = download_names)
#> return(download_hash(hash, directory_to_save))
#> }
#> <bytecode: 0x55a63a98c400>
#> <environment: namespace:amadeus>