8 HCUP and Amadeus Smoke Plume Use Case

Profile-CMP Profile-CDM Profile-CHW Profile-STU

Integrating HCUP databases with Amadeus Exposure data

Date Modified: April 29, 2025

Author: Darius M. Bost

Programming Language: R

8.1 Motivation

Understanding the relationship between external environmental factors and health outcomes is critical for guiding public health strategies and policy decisions. Integrating individual patient records from the Healthcare Cost and Utilization Project (HCUP) with data from environmental datasets allows researchers to examine how elements such as air quality, wildfire emissions, and extreme temperatures impact hospital visits and healthcare utilization patterns.

Ultimately, linking HCUP and environmental exposure data enhances public health monitoring and helps researchers better quantify environmental health risks.

8.2 Tutorial

8.5 Data Analysis using HCUP and Amadeus data sources

First we will load in our hcup data file we processed earlier and subset the file to a set of observations that make the data easier to work with (702 to 39 columns) and are still interesting for analysis. This includes zipcodes (ZIP), age at admission (AGE), admission month (AMONTH), race identifier (RACE), sex (FEMALE), and ICD 10 diagnosis codes (I10_).

or_sedd_2021 <- fread("OR_SEDD_2021_CORE.csv")
subset_data <- or_sedd_2021 %>%
  select(FEMALE, ZIP, PSTCO, AGE, RACE, AMONTH, starts_with("I10_"))
head(subset_data)

# [1] "FEMALE"               "ZIP"                  "PSTCO"
#  [4] "AGE"                  "RACE"                 "AMONTH"
#  [7] "I10_DX_Visit_Reason1" "I10_DX_Visit_Reason2" "I10_DX_Visit_Reason3"
# [10] "I10_DX1"              "I10_DX2"              "I10_DX3"
# [13] "I10_DX4"              "I10_DX5"              "I10_DX6"
# [16] "I10_DX7"              "I10_DX8"              "I10_DX9"
# [19] "I10_DX10"             "I10_DX11"             "I10_DX12"
# [22] "I10_DX13"             "I10_DX14"             "I10_DX15"
# [25] "I10_DX16"             "I10_DX17"             "I10_DX18"
# [28] "I10_DX19"             "I10_DX20"             "I10_DX21"
# [31] "I10_DX22"             "I10_DX23"             "I10_DX24"
# [34] "I10_DX25"             "I10_DX26"             "I10_DX27"
# [37] "I10_DX28"             "I10_NDX"              "I10_PROCTYPE"

Next we will select July as our month of interest to further reduce the size of the data and to focus on a time frame where we know fires took place in Oregon. We will also load in our environmental data files we made above from amadeus.

# subset data to July
july_subset_hcup_data <- subset_data[subset_data$AMONTH == 7, ]

# load in amadeus files we made previously
avg_smoke_density <- readRDS("smoke_density_avg_byZip.R")
total_smoke_density <- readRDS("smoke_density_total_byZip.R")

8.5.1 Merging Environmental Data with Hospital Data

We will now merge our environmental data into our hospital discharge (HCUP) data using an inner join on ZIP codes present in both datasets.

# Perform an inner join to merge `july_subset_hcup_data` with
# `avg_smoke_density` based on the ZIP code (`ZIP` in HCUP data and
# `ZCTA5CE10` in smoke density data)
merged_data <- inner_join(july_subset_hcup_data, avg_smoke_density,
                          by = c("ZIP" = "ZCTA5CE10"))

# Perform another inner join to add `total_smoke_density` to the existing
# `merged_data`
merged_data <- inner_join(merged_data, total_smoke_density,
                          by = c("ZIP" = "ZCTA5CE10"))

8.5.2 Identifying Asthma Cases

Next, we will identify individuals diagnosed with asthma. This involves searching for the ICD-10 code “J45” within the diagnosis columns of our dataset.

# Identify the columns containing diagnosis codes (prefix "I10_")
diag_columns <- grep("^I10_", colnames(merged_data), value = TRUE)

# Create a new column `has_asthma` that checks if any diagnosis contains "J45"
smoke_summary <- merged_data %>%
  mutate(has_asthma = apply(select(., all_of(diag_columns)), 1, function(x) {
    any(grepl("J45", x))
  }))

# Count total number of individuals in the dataset
total_individuals <- nrow(smoke_summary)

# Count the number of individuals with an asthma diagnosis
asthma_cases <- sum(smoke_summary$has_asthma, na.rm = TRUE)

# Calculate the proportion of individuals diagnosed with asthma
asthma_prevalence <- asthma_cases / total_individuals

8.5.3 Visualizing the Relationship Between Heavy Smoke Exposure and Asthma

We will now generate a boxplot to visualize the distribution of average heavy smoke days across individuals with and without an asthma diagnosis.

ggplot(smoke_summary, aes(x = factor(has_asthma), y = avg_heavy,
                          fill = factor(has_asthma))) +
  geom_boxplot() +
  labs(
    x = "Has Asthma (J45)",
    y = "Average Heavy Smoke Days",
    fill = "Has Asthma",
    title = "Average Heavy Smoke Days vs Asthma Diagnosis"
  ) +
  theme_minimal()

8.5.4 Bivariate Map Analysis: Asthma and Heavy Smoke Exposure

The code below producess a bivariate map for spatial analysis of asthma prevalence and heavy smoke exposure across ZIP codes in Oregon. Using hospital discharge data HCUP and NOAA’s HMS smoke plume data, ZIP-level asthma rates were calculated and paired with the average number of heavy smoke days over the same time period.

# Load additional libraries
library(biscale)
library(cowplot)

# Recall we have identified diagnosis columns
head(diag_columns)

# Recall our smoke summary dataframe
head(smoke_summary)

# Summarize asthma rate and heavy smoke exposure by ZIP code
zip_summary <- smoke_summary %>%
  group_by(ZIP) %>%
  summarize(
    asthma_rate = mean(has_asthma, na.rm = TRUE),
    avg_heavy_smoke = mean(sum_heavy, na.rm = TRUE)
  )

# Recall ZIP code shapefile (ZCTA) for Oregon
head(or)

# Join spatial and summary data
map_data <- left_join(or, zip_summary, by = c("ZCTA5CE10" = "ZIP"))

# Filter out missing values
map_data_clean <- map_data %>%
  filter(!is.na(asthma_rate) & !is.na(avg_heavy_smoke))

# Apply bivariate classification
map_data_clean <- bi_class(map_data_clean, x = asthma_rate,
                           y = avg_heavy_smoke, style = "quantile", dim = 3)

# Create the main map
map <- ggplot() +
  theme_void(base_size = 14) +
  geom_sf(data = map_data_clean, aes(fill = bi_class), color = "white",
          size = 0.1, show.legend = FALSE) +
  bi_scale_fill(pal = "GrPink", dim = 3) +
  labs(
    title = "Asthma Prevalence vs Heavy Smoke Exposure by ZIP Code",
    subtitle = "Bivariate map showing intersection of health and environmental
    burden",
    caption = "Source: HCUP-Amadeus & NOAA HMS Smoke Data"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    plot.caption = element_text(size = 10, face = "italic", hjust = 1),
    plot.margin = margin(10, 20, 10, 20)
  )

# Create the legend
legend <- bi_legend(
  pal = "GrPink",
  dim = 3,
  xlab = "Higher Smoke",
  ylab = "Higher Asthma",
  size = 10,
  flip_axes = FALSE,
  rotate_pal = FALSE
)

# Combine map and legend
final_plot <- ggdraw() +
  draw_plot(map, 0, 0, 1, 1) +
  draw_plot(legend, 0.77, 0.05, 0.2, 0.2)

# Display the final plot
final_plot

Each ZIP code is shaded based on the intersection of these two variables using a 3x3 quantile classification. The bivariate color scale in the legend shows increasing smoke exposure along the x-axis (red) and increasing asthma prevalence along the y-axis (blue):

Dark red areas: High smoke exposure, low asthma prevalence

Dark blue areas: High asthma prevalence, low smoke exposure

Dark purple areas: High asthma and high smoke — indicating areas with compounded health and environmental burdens

Light gray areas: Low on both dimensions

This bivariate map helps identify regions where environmental and health vulnerabilities intersect and can inform targeted public health responses.

8.5.5 Logistic Regression Analysis

Finally, we fit a logistic regression model to examine the relationship between asthma diagnoses and exposure to different levels of smoke density.

# Fit a logistic regression model with asthma diagnosis as the outcome variable
# and different smoke exposure levels as predictors
model <- glm(has_asthma ~ avg_light + avg_medium + avg_heavy,
             data = smoke_summary, family = binomial)

# Display model summary
summary(model)
# Call:
# glm(formula = has_asthma ~ avg_light + avg_medium + avg_heavy,
#     family = binomial, data = smoke_summary)
#
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept) -3.38823    0.09077 -37.329  < 2e-16 ***
# avg_light   -0.21258    0.30322  -0.701    0.483
# avg_medium   1.74996    0.32456   5.392 6.98e-08 ***
# avg_heavy    1.82572    0.16826  10.850  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 42004  on 111124  degrees of freedom
# Residual deviance: 41674  on 111121  degrees of freedom
# AIC: 41682
#
# Number of Fisher Scoring iterations: 6

The output provides estimates for each predictor, helping us assess the impact of light, medium, and heavy smoke exposure on asthma prevalence.