Modeling respiratory related mortality in California from air pollution and social deprivation data
Data files
Oct 30, 2025 version files 33.33 MB
-
CA_census_pops1019.xlsx
2.80 MB
-
CA_shapefiles.zip
2.89 MB
-
Cal-ViDa_Death1423.xlsx
26.11 MB
-
final_EPA_data.csv
1.13 MB
-
INLA-7.2-Final.Rmd
266.96 KB
-
Population_Categories.xlsx
15.69 KB
-
README.md
5.92 KB
-
SoA.data.1019.xlsx
101.18 KB
Abstract
In the application study of our submitted manuscript, we demonstrate our proposed methodology by modeling respiratory related deaths at the county and monthly level across California from 2015-2019. This data is downloaded from the California Department of Public Health's California Vital Data (Cal-ViDa) query tool. The model leverages spatial patterns in social deprivation index (SDI) from the Society of Actuaries (SoA) and air pollutant measurements from the US Environmental Protection Agency (EPA) to estimate the spatiotemporal dependence structure of the response. All of the data used for this application study is publicly available and thus will all be provided in this repository.
Dataset DOI: 10.5061/dryad.j3tx95xtt
Description of the data and file structure
This folder contains all the datasets and geographic files needed to run the analysis code in INLA-7.2-Final.Rmd. These datasets were downloaded from various publicly available sources.
Files and variables
- CA_census_pops1019.xlsx: census population data for counties of California for each year 2010-2019
- REGION: US census region code
- DIVISION: US census division code
- STATE: US census state code
- COUNTY: US census county code
- STNAME: state name (all California)
- CTYNAME: county name (all Californian counties)
- POPESTIMATE2010: census population estimate for given county in 2010
- POPESTIMATE201: census population estimate for given county in 2011
- POPESTIMATE2012: census population estimate for given county in 2012
- POPESTIMATE2013: census population estimate for given county in 2013
- POPESTIMATE2014: census population estimate for given county in 2014
- POPESTIMATE2015: census population estimate for given county in 2015
- POPESTIMATE2016: census population estimate for given county in 2016
- POPESTIMATE2017: census population estimate for given county in 2017
- POPESTIMATE2018: census population estimate for given county in 2018
- POPESTIMATE2019: census population estimate for given county in 2019
- CA_shapefiles.zip: shape files needed to perform spatial analysis and plot maps of California. Shapefiles are a common format for vector-based geographic information system (GIS) data. They can be opened and used in any GIS software and in R or Python. A shapefile consists of multiple file types beyond the .shp (specifically, .cpg, .dbf, .prj, .sbn, and .sbx). The user only interacts directly with the .shp file but the other files need to be in the same directory.
- Cal-ViDa_Death1423.xlsx
- Year_of_Death: year of death
- Month_of_Death: month of death
- County_of_Death: county of death
- Age: age group (categorical)
- Cause_of_Death: reason for death (filter to influenza and pneumonia and chronic lower respiratory diseases)
- Total_Deaths: death counts (small counts censored)
- final_EPA_data.csv
- County: county
- Year-Month: month of observation in date form
- Value: air pollutant measurement value (unit varies by pollutant)
- Lead (micrograms/m3)
- CO (parts per million)
- SO2 (parts per billion)
- NO2 (parts per billion)
- O3 (parts per million)
- PM10 (micrograms/m3)
- PM2.5 (micrograms/m3)
- AQI: air quality index (separate variable from air pollutants)
- Pollutant: air pollutant variable code
- 14129: Lead
- 42101: CO
- 42401: SO2
- 42602: NO2
- 44201: O3
- 81102: PM10
- 88101: PM2.5
- Population_Categories.xlsx: annual estimates of the resident population for California state for selected age groups and by sex for the years 2020-2022
- SoA.data.1019.xlsx
- county_fips: unique identifier for each county
- county_name: county name
- State_USPS: state abbreviation (all CA)
- Year: year
- Source: where the data came from (American Community Survey)
- Score: social deprivation index score (unitless) calculated from subindices
- Decile: score decile among counties
- Quintile: score quintile among counties
- Total_Pop: population of county
- EDUC_Lessthan9: % of the population aged 25 and over with less than 9 years of education
- EDUC_college: % of the population aged 25 and over with at least 4 years of college education
- White_Collar: % of the population aged 16 and over employed in a white collar occupation
- Unemployment_Rate: unemployment rate (%) for the population 16 years and over
- Adj_HH_income: median household income adjusted for local housing costs (dollars)
- Income_Disparity: ratio of the average household income in the lowest quintile to the average household income in the highest quintile
- Individuals_Below_Poverty: % of the population below the federal poverty threshold
- Median_Home_Value: median home value for owner occupied units (dollars)
- Median_Gross_Rent: median gross rent for rental units (dollars)
- Housing_No_Telephone: % of housing without a telephone
- Housing_Incomplete_Plumbing: % of housing without complete plumbing
To obtain all the cleaned, aggregated, assembled data at once, one can use INLA-7.2-Premodeling.RData file. See https://github.com/jeffwu25/KGR-SKATER/tree/main to see main analysis examples and supplementary analysis.
Code/software
All analysis was run in RStudio using R 4.5.1. The RMD file INLA-7.2-Final contains an example of the full analysis workflow.
The following packages are used in that code file:
- dplyr
- tidyverse
- lubridate
- stringr
- zoo
- ggplot2
- urbnmapr
- devtools
- readxl
- spdep
- sp
- huge
- INLA
- HMMpa
- invgamma
- brinla
- reshape2
- patchwork
- jsonlite
- geosphere
- RAQSAPI
- con2aqi
- pscl
- corrplot
- superheat
- shapes
- scales
Access information
Data was derived from the following sources:
- Mortality data: https://cal-vida.cdph.ca.gov/VSQWeb
- EPA air pollutant data: https://aqs.epa.gov/aqsweb/documents/data_api.html
- Social deprivation index data: https://www.soa.org/resources/research-reports/2020/us-mort-rate-socioeconomic/
Human subjects data
The mortality data is from Cal-ViDa which already de-identifies the data. Furthermore, small count cells are censored and must be imputed. The data is aggregated to the county level for the application study.
