Annotations on COVID-19 state data definitions as of March 7, 2021
Data files
Feb 24, 2022 version files 763.05 KB
-
data-sources.csv
19.26 KB
-
definitions.csv
7.52 KB
-
gis-links.md
40.74 KB
-
hospitalization.csv
29.26 KB
-
README.md
13.58 KB
-
sets.csv
2.65 KB
-
state-metadata.csv
650.04 KB
Abstract
The COVID Tracking Project was a volunteer organization launched from The Atlantic and dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the United States. Our dataset was in use by national and local news organizations across the United States and by research projects and agencies worldwide.
In the US, health data infrastructure has always been siloed. Fifty-six states and territories maintain pipelines to collect infectious disease data, each built differently and subject to different limitations. The unique constraints of these uncoordinated state data systems, combined with an absence of federal guidance on how states should report public data, created a big problem when it came to assembling a national picture of COVID-19’s spread in the US: Each state has made different choices about which metrics to report and how to define them—or has even had its hand forced by technical limitations.
Those decisions have affected both The COVID Tracking Project’s data, assembled from states’ public data releases, and still affect the CDC’s data, which mostly comes from submissions from state and territorial health departments. And they have had real consequences for the numbers: A state’s data definitions might be the difference between the state appearing to have 5% versus 20% test positivity, between labeling a COVID-19 case as active versus recovered, or between counting or not counting large numbers of COVID-19 cases and deaths at all.
Because state definitions affect the data we collect, COVID Tracking Project researchers have needed to maintain structured, detailed records on how states define all the testing and outcomes data points we capture in our API (and a few we don’t). Internally, we call this constantly evolving body of knowledge “annotations.” Today, we are for the first time publishing our complete collection of annotations.
Methods
This dataset was compiled by volunteers with The COVID Tracking Project. As states changed their definitions of testing, outcomes, and hospitalization figures, we updated a centralized database of annotations by-state and by-metric.
Usage notes
This is a one-time snapshot of our research into state and territorial definitions as of March 3, 2021, rather than a constantly-updating source of information. As you use or build on our work, remember that state COVID-19 information changes quickly—some of our information will have already fallen out of date. And unusually for The COVID Tracking Project, we’ve chosen to release some information in the annotations that hasn’t been double-checked by experienced contributors before its release, so it may contain classification mistakes. Given the timing of our shutdown, we decided that providing a comprehensive look at our annotation structures was more important than releasing only a subset of information that we were sure was 100% accurate.
We hope that this full view of our annotations will be of use to researchers and data users aiming to understand state-level COVID-19 data and the methodologies we have used to collect and analyze it.
Hospitalization
You can read about what our current hospitalizations annotations mean and the work they’ve supported at CTP.
“Source Notes” are a set of instructions for finding the data on state pages and dashboards. States may have changed their reporting since we last looked at them and older annotations may not be reliable. All annotations have a “Last Checked” date that will let you see when we last looked at the metric in question.
You can find out how a state reports their current COVID hospitalizations in the “State Subgroup Labels” column of the hospitalization annotations. This field lists what the state calls the groups it includes in its current hospitalizations metrics. If the annotation is listed as “unclear,” this indicates a data definition was either unavailable or was missing information. “Not reported” indicates that a state does not report current COVID hospitalizations.
The “Evidence” column includes the exact language of material on the state website that leads us to our conclusions. Definitions frequently change, so keep in mind that a metric may not mean the same thing now that it did previously. These annotations reflect only the most recent version of the metric or definition as of the last time we checked.
To determine whether a state lumps their hospitalization metrics, look at the “Cases Reporting” column in Airtable. The cell will include the word “lumped” if the state-reported metric is lumped together. If the cell lists “unclear,” this means the data definition for the state either wasn’t provided or was unclear.
Another major source of variation in hospitalizations definitions is whether states track adult patients, pediatric patients, or both. The vast majority of states are unclear about the populations for which they are tracking current hospitalizations. The “Population” column of this annotation set lists the population included in a state’s currently hospitalized COVID metric.
Annotations
We create annotations for the different metrics that we track using the labels from annotations. Metrics can have multiple annotations, for two reasons:
- They can have annotations from multiple different annotation sets. For example, we may have an annotation on both whether a case metric includes non-residents of a given state and what CSTE definition it is following.
- If there are conflicting sources of evidence: We track information from state health department webpages, outreach, and external media reporting. Sometimes, we may get different information from outreach to a state health department than what they post on their website—or even two conflicting pieces of information from the same kind of source. At a minimum, we always annotate a metric with labels for any annotation sets that apply to it reflecting what the state says about the subject in its documentation on its website, for a simple reason: we want to know how clear the state is being in its public presentation of the data.
To be labeled with the “website” source type, we expect information the state provides will be accessible from their data pages and presented as an evergreen resource with the clear intent of defining a metric. Examples include data definition documents, dashboard footnotes, or definitions appearing daily in press releases. If the state is not providing that kind of information, we label the metric as “unclear” according to the state health department website, before searching for other sources of information—whether that’s external reporting, our own outreach to state health department officials, or buried resources on the state website (categorized as “sleuthing”).