Skip to main content

Diffusion segregation and the disproportionate incidence of COVID-19 in African American communities

Cite this dataset

Nicosia, Vincenzo; Bassolas, Aleix; Sousa, Sandro (2020). Diffusion segregation and the disproportionate incidence of COVID-19 in African American communities [Dataset]. Dryad.


Each network corresponds to a metropolitan area, where nodes represent census tracts. For each tract we report information about the number of people belonging to each of the seven high-level ethnic groups defined in the US Census. Physical adjacency networks are undirected and unweighted, and an edge between two tracts A and B indicates that A and B are bordering each other. Commuting flow graphs are undirected and weighted, and the weight of an the edge between A and B corresponds to average of the total number of work commuting trips from A to B and from B to A.


- Covid Data

The incidence of covid cases and deceased African Americans data was obtained from [1] and the data was used at its original state. The detailed temporal data of Covid cases considered on the multivariate analysis was obtained from [2].

- Adjacency networks

The ethnicity data associated with the node properties were obtained from the NHGIS [3] website where the dataset was split into individual files for each Combined Statistical Area (CSA) according to the column denominated "CSAA". The adjacencies were obtained from the shapefiles with areal units delineations at the census tract level obtained from [3]. With the list of census tracts within a CSA, we looked for the neighbours of each tract if they shared a border and constructed the subgraph of the largest component so that the resulting edge-list and node properties include only the elements of the induced subgraph.

- Commuting networks

The commuting graphs are weighted networks obtained from the information of where people live and work [4]. In particular, the weight of a link going from a node i to a node j is given by the sum of the individuals living in i and working in j and those working in j and living in i. In this case, the population of an ethnicity is the sum of those individuals living in a cell plus the weight of the incoming link times the residents of the same ethnicity in the origin.

- Code

The reader interested in replicating the methods used to create the data can obtain the python scrips in the following repository:

Note that the repository also includes the code to simulate the CCT and MFPT random walks on the adjacency and commute graphs so that the whole simulation can be replicated.

- References:

[1] Black Population in US —; 2016. [Online; accessed 2020-05-30].

[2] COVID Racial Data Tracker; 2016. [Online; accessed 2020-05-30].

[3] Manson S, Schroeder J, Riper DV, Ruggles S. IPUMS National Historical Geographic Information System: Version 14.0 [Database]. Minneapolis, MN: IPUMS. 2019.

[4] Longitudinal Employer-Household Dynamics; 2016. Online; accessed 2020-05-30.

[5] Bureau UC. American Community Survey 2014-2018 5-Year Data Release. US Govern- ment; 2019. Available from:

Usage notes

This data set is associated with the paper "Diffusion segregation and the disproportionate incidence of COVID-19 in African American communities" by A. Bassolas, S. Sousa, V. Nicosia, Journal of The Royal Society Interface (in press).

The data set consists of :

Networks of physical adjacency and commuting flows among census tracts in 171 major US cities; ethnicity distributions at level of census tracts.

- Census data summary

* Year:             2010
* Geographic level: Census Tract (by State--County)
* Dataset:          2010 Census: SF 1a - P & H Tables [Blocks & Larger Areas]
* NHGIS code:       2010_SF1a
* NHGIS ID:         ds172
* Breakdown(s):     Geographic Subarea:

  Universe:    Total population
  Source code: P3
  NHGIS code:  H7X

- Geographic data summary

Shapefile: 2010 TIGER/Line + at Census Tract level
Extent: US country level
NHGIS modified the TIGER/Line definitions only by erasing coastal water areas.

- Commuting data summary

The number of people working in a census unit and living in another has been obtained


- Obtaining the edge list and node properties for CCT data

At the repository mentioned in the Code section the reader will find the script "" which returns the edge-list and the nodes properties (ethnicity distribution) of the largest connected component. The script takes the following inputs:
> Shapefile of the census tracts at country extent;
> The loop uk code (GISJOIN) string to match the tracts;
> CSV file containing the tracts within the CSA;
> Optional filename to save the nodes properties of the resulting connected component.

Obtaining the individual CSV file for each CSA is trivial and can be obtained either by filtering the data by the `CSAA` field using the preferred editor or using a library such as pandas in python to group rows by the column. The CSV files from this process are available in the folder `census_ethnicity_csa`.

The resulting file with the node properties is named by convention `nodes_ethnics_agg_censustract_csa_XXX` where XXX refers to the CSA numeric code. The column containing the total population was removed so that the output file contains:

        'Node ID',
        'White alone'
        'Black or African American',
        'American Indian and Alaska Native',
        'Native Hawaiian and Other Pacific Islander',
        'Some Other Race',
        'Two or More Races'

Note that the GISJOIN field is not the same as GEOID10, for merging data from NHGIS the GISJON field must be used. The definition of each CSA for the adjacency data is available in the file `csacodes_adjacency.csv` and it contains the CSA numeric code, description and state, for CSA definitions of the commute data please consult the file `csacodes.txt` in the MFPT folder.

Edge-list properties:
* Node IDs are defined by the row index of the population table.
* Edges are reported once.
* Resulting graph considering the giant component only.

- Running the CCT random walk
The random walk on the adjacency graph can be simulated by running the python script `` available at the repository in section Code and the following input must be provided:

> edge:    Edge-list file;
> prop:    Node properties file with the frequency of each ethnicity;
> num:     Number of walk repetitions from the same node;
> epsilon: Threshold value for the JSD divergence;
> idx:     range of nodes or single node ID to run the walk from, e.g.: 1, 0-10.

All the files provided here in this repository are already in input format needed by the script. An output file for each node will be saved locally at the same directory where the code is being executed and follows the format:

"Ethnicity" "Time to reach epsilon"

Where each line corresponds to one repetition of the random walk from node i.

The commute network uses the "" script and the corresponding files in the "commute" folder should be used where the output follows the same format.

- List of files in the CCT repository

* edges_ids_censustract_csa_XXX: Edge-list with the assigned numeric ID
* nodes_ethnics_agg_censustract_csa_XXX: Node properties associated with the census tract

Suplementary files:
* edges_geocode_censustract_csa_XXX: Edge-list with the GISJOIN codes
* nodes_id_geocode_XXX: Look up codes for node IDs to GISJOIN
* ethnics_agg_censustract_csa_XXX: Original CSV file with ethnicity data

Note that the commute data in the CCT folder corresponds to the same used in
MFPT where reformatting was used for convenience only to run the `CCT`
random walk.

*colorethmix_XXX: Node properties
*network_XXX: Edge-list with the corresponding edge weight

- List of files in the CMFPT repository

Network files for each city
Ethnicity files for each city
Cell assignment files for each city
Codes to obtain the mfpt between ethnicities for the adjacency and commuting grapjs

Network files:

The network files for the adjacency graph can be found are adjcsa/network_*.csv  where * corresponds to the city CSA code
The network files for the commuting graph can be found are comcsa/network_*.csv  where * corresponds to the city CSA code

It is a directed graph whose heading corresponds to:


Ethnicity files:

The ethnicity files for the adjacency graph can be found are adjcsa/coloreth_*.csv  where * corresponds to the city CSA code
The ethnicity files for the commuting graph can be found are comcsa/colorethmix_*.csv  where * corresponds to the city CSA code

The heading for them is:


The correspondence between the class code and each ethnicity is:

0-->White alone
1-->Black or African American alone
2-->American Indian and Alaska Native alone
3--> Asian alone
4--> Native Hawaiian and Other Pacific Islander alone
5-->Some Other Race alone
6-->Two or More Races

Cell equivalence files:

The equivalence files for the adjacency graph can be found are adjcsa/classequi_*.csv  where * corresponds to the city CSA code
The equivalence files for the commuting graph can be found are comcsa/classequi_*.csv  where * corresponds to the city CSA code

The heading for them is


CMFPT Codes:

There a total of 4 codes, two for the results on the real city and two for the null-model.

The codes for the real cities are name as adj.c for the adjacency network and com.c for the commuting graph and both can be used by compiling and introducing the csa code of the desired city

gcc adj.c -lm
./a.out csa_code

gcc com.c -lm
./a.out csa_code

The codes for the null model are named as adj_null.c for the adjacency network and com_null.c for the commuting graph and both can be used by compiling and introducing the csa code of the desired city:

gcc adj_null.c -lm
./a.out csa_code

gcc com_null.c -lm
./a.out csa_code


Engineering and Physical Sciences Research Council, Award: EP/S027920/1