8 HCUP and Amadeus Smoke Plume Use Case

Profile-CMP Profile-CDM Profile-CHW Profile-STU

Integrating HCUP databases with Amadeus Exposure data

Date Modified: April 29, 2025

Author: Darius M. Bost

Programming Language: R

8.1 Motivation

Understanding the relationship between external environmental factors and health outcomes is critical for guiding public health strategies and policy decisions. Integrating individual patient records from the Healthcare Cost and Utilization Project (HCUP) with data from environmental datasets allows researchers to examine how elements such as air quality, wildfire emissions, and extreme temperatures impact hospital visits and healthcare utilization patterns.

Ultimately, linking HCUP and environmental exposure data enhances public health monitoring and helps researchers better quantify environmental health risks.

8.2 Tutorial

8.5 Data Analysis using HCUP and Amadeus data sources

First we will load in our hcup data file we processed earlier and subset the file to a set of observations that make the data easier to work with (702 to 39 columns) and are still interesting for analysis. This includes zipcodes (ZIP), age at admission (AGE), admission month (AMONTH), race identifier (RACE), sex (FEMALE), and ICD 10 diagnosis codes (I10_).

or_sedd_2021 <- fread("OR_SEDD_2021_CORE.csv")
subset_data <- or_sedd_2021 %>%
  select(FEMALE, ZIP, PSTCO, AGE, RACE, AMONTH, starts_with("I10_"))
head(subset_data)

# [1] "FEMALE"               "ZIP"                  "PSTCO"
#  [4] "AGE"                  "RACE"                 "AMONTH"
#  [7] "I10_DX_Visit_Reason1" "I10_DX_Visit_Reason2" "I10_DX_Visit_Reason3"
# [10] "I10_DX1"              "I10_DX2"              "I10_DX3"
# [13] "I10_DX4"              "I10_DX5"              "I10_DX6"
# [16] "I10_DX7"              "I10_DX8"              "I10_DX9"
# [19] "I10_DX10"             "I10_DX11"             "I10_DX12"
# [22] "I10_DX13"             "I10_DX14"             "I10_DX15"
# [25] "I10_DX16"             "I10_DX17"             "I10_DX18"
# [28] "I10_DX19"             "I10_DX20"             "I10_DX21"
# [31] "I10_DX22"             "I10_DX23"             "I10_DX24"
# [34] "I10_DX25"             "I10_DX26"             "I10_DX27"
# [37] "I10_DX28"             "I10_NDX"              "I10_PROCTYPE"

Next we will select July as our month of interest to further reduce the size of the data and to focus on a time frame where we know fires took place in Oregon. We will also load in our environmental data files we made above from amadeus.

# subset data to July
july_subset_hcup_data <- subset_data[subset_data$AMONTH == 7, ]

# load in amadeus files we made previously
avg_smoke_density <- readRDS("smoke_density_avg_byZip.R")
total_smoke_density <- readRDS("smoke_density_total_byZip.R")

8.5.1 Merging Environmental Data with Hospital Data

We will now merge our environmental data into our hospital discharge (HCUP) data using an inner join on ZIP codes present in both datasets.

# Perform an inner join to merge `july_subset_hcup_data` with
# `avg_smoke_density` based on the ZIP code (`ZIP` in HCUP data and
# `ZCTA5CE10` in smoke density data)
merged_data <- inner_join(july_subset_hcup_data, avg_smoke_density,
                          by = c("ZIP" = "ZCTA5CE10"))

# Perform another inner join to add `total_smoke_density` to the existing
# `merged_data`
merged_data <- inner_join(merged_data, total_smoke_density,
                          by = c("ZIP" = "ZCTA5CE10"))

8.5.2 Identifying Asthma Cases

Next, we will identify individuals diagnosed with asthma. This involves searching for the ICD-10 code “J45” within the diagnosis columns of our dataset.

# Identify the columns containing diagnosis codes (prefix "I10_")
diag_columns <- grep("^I10_", colnames(merged_data), value = TRUE)

# Create a new column `has_asthma` that checks if any diagnosis contains "J45"
smoke_summary <- merged_data %>%
  mutate(has_asthma = apply(select(., all_of(diag_columns)), 1, function(x) {
    any(grepl("J45", x))
  }))

# Count total number of individuals in the dataset
total_individuals <- nrow(smoke_summary)

# Count the number of individuals with an asthma diagnosis
asthma_cases <- sum(smoke_summary$has_asthma, na.rm = TRUE)

# Calculate the proportion of individuals diagnosed with asthma
asthma_prevalence <- asthma_cases / total_individuals

8.5.3 Visualizing the Relationship Between Heavy Smoke Exposure and Asthma

We will now generate a violin plot to visualize the distribution of average heavy smoke days across individuals with and without an asthma diagnosis.

ggplot(smoke_long, aes(x = factor(has_asthma), y = smoke_days,
                       fill = factor(has_asthma))) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, outlier.shape = NA, alpha = 0.5) +
  facet_wrap(~ smoke_type, scales = "free_y") +
  labs(
    x = "Asthma Diagnosis (J45)",
    y = "Average Smoke Days",
    fill = "Asthma Status",
    title = "Violin Plots: Smoke Exposure by Asthma Diagnosis",
    subtitle = "Faceted by smoke type (light, medium, heavy)"
  ) +
  theme_minimal()

This figure shows the distribution of average smoke exposure days by asthma status (J45 diagnosis), separated into three smoke types: light, medium, and heavy. Each subplot is a violin plot faceted by smoke type, comparing individuals with asthma (TRUE) and without asthma (FALSE).

Key Observations from Violin Plots

avg_light

  • Both asthma and non-asthma groups are centered around ~0.2 average light smoke days.
  • The distributions are broad and overlapping, suggesting no strong visual difference.

avg_medium

  • Both groups have a narrower range than light exposure, mostly between 0.1–0.3 days.
  • There is slightly more central density in the asthma group around 0.2, but overall the distributions still overlap substantially.

avg_heavy

  • This variable shows the greatest skew: most individuals have very low or zero average heavy smoke exposure.
  • However, there is a small tail in the asthma group suggesting higher exposure in some individuals.

8.5.4 Bivariate Map Analysis: Asthma and Heavy Smoke Exposure

The code below producess a bivariate map for spatial analysis of asthma prevalence and heavy smoke exposure across ZIP codes in Oregon. Using hospital discharge data HCUP and NOAA’s HMS smoke plume data, ZIP-level asthma rates were calculated and paired with the average number of heavy smoke days over the same time period.

# Load additional libraries
library(biscale)
library(cowplot)

# Recall we have identified diagnosis columns
head(diag_columns)

# Recall our smoke summary dataframe
head(smoke_summary)

# Summarize asthma rate and heavy smoke exposure by ZIP code
zip_summary <- smoke_summary %>%
  group_by(ZIP) %>%
  summarize(
    asthma_rate = mean(has_asthma, na.rm = TRUE),
    avg_heavy_smoke = mean(sum_heavy, na.rm = TRUE)
  )

# Recall ZIP code shapefile (ZCTA) for Oregon
head(or)

# Join spatial and summary data
map_data <- left_join(or, zip_summary, by = c("ZCTA5CE10" = "ZIP"))

# Filter out missing values
map_data_clean <- map_data %>%
  filter(!is.na(asthma_rate) & !is.na(avg_heavy_smoke))

# Apply bivariate classification
map_data_clean <- bi_class(map_data_clean, x = asthma_rate,
                           y = avg_heavy_smoke, style = "quantile", dim = 3)

# Create the main map
map <- ggplot() +
  theme_void(base_size = 14) +
  geom_sf(data = map_data_clean, aes(fill = bi_class), color = "white",
          size = 0.1, show.legend = FALSE) +
  bi_scale_fill(pal = "GrPink", dim = 3) +
  labs(
    title = "Asthma Prevalence vs Heavy Smoke Exposure by ZIP Code",
    subtitle = "Bivariate map showing intersection of health and environmental
    burden",
    caption = "Source: HCUP-Amadeus & NOAA HMS Smoke Data"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    plot.caption = element_text(size = 10, face = "italic", hjust = 1),
    plot.margin = margin(10, 20, 10, 20)
  )

# Create the legend
legend <- bi_legend(
  pal = "GrPink",
  dim = 3,
  xlab = "Higher Smoke",
  ylab = "Higher Asthma",
  size = 10,
  flip_axes = FALSE,
  rotate_pal = FALSE
)

# Combine map and legend
final_plot <- ggdraw() +
  draw_plot(map, 0, 0, 1, 1) +
  draw_plot(legend, 0.77, 0.05, 0.2, 0.2)

# Display the final plot
final_plot

Each ZIP code is shaded based on the intersection of these two variables using a 3x3 quantile classification. The bivariate color scale in the legend shows increasing smoke exposure along the x-axis (red) and increasing asthma prevalence along the y-axis (blue):

Dark red areas: High smoke exposure, low asthma prevalence

Dark blue areas: High asthma prevalence, low smoke exposure

Dark purple areas: High asthma and high smoke — indicating areas with compounded health and environmental burdens

Light gray areas: Low on both dimensions

This bivariate map helps identify regions where environmental and health vulnerabilities intersect and can inform targeted public health responses.

We observe clusters of purple areas, particularly in southern and eastern Oregon — ZIPs that face both high smoke exposure and high asthma prevalence. In contrast, parts of the western coastal area are red — high smoke but low asthma, while parts of northeastern Oregon are blue — high asthma but lower smoke.

This can be explained by: Smoke from fires (especially in southwestern Oregon and northern California) is carried eastward/northeastward by prevailing winds.

As smoke moves inland, it encounters the Cascade Range, where it: Is forced to rise and spill over peaks, gets concentrated in valleys and downwind ZIP codes. This orographic uplift and channeling can increase heavy smoke exposure in ZIP codes east of the mountains, aligning with the purple zones in your bivariate map—areas with both heavy smoke exposure and high asthma prevalence.

Public Health Relevance This type of bivariate mapping can help prioritize resources for ZIP codes that are most vulnerable. For example, purple regions may benefit from both air quality interventions and respiratory health programs, while blue regions may warrant investigation into other asthma risk factors.

8.5.5 Logistic Regression Analysis

Finally, we fit a logistic regression model to examine the relationship between asthma diagnoses and exposure to different levels of smoke density.

# Fit a logistic regression model with asthma diagnosis as the outcome variable
# and different smoke exposure levels as predictors
model <- glm(has_asthma ~ avg_light + avg_medium + avg_heavy,
             data = smoke_summary, family = binomial)

# Display model summary
summary(model)
# Call:
# glm(formula = has_asthma ~ avg_light + avg_medium + avg_heavy,
#     family = binomial, data = smoke_summary)
#
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept) -3.38823    0.09077 -37.329  < 2e-16 ***
# avg_light   -0.21258    0.30322  -0.701    0.483
# avg_medium   1.74996    0.32456   5.392 6.98e-08 ***
# avg_heavy    1.82572    0.16826  10.850  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 42004  on 111124  degrees of freedom
# Residual deviance: 41674  on 111121  degrees of freedom
# AIC: 41682
#
# Number of Fisher Scoring iterations: 6

The output provides estimates for each predictor, helping us assess the impact of light, medium, and heavy smoke exposure on asthma prevalence.