Data and code from: The impact of light-rail stations on income sorting in US urban areas

Name: Data and code from: The impact of light-rail stations on income sorting in US urban areas
Creator: Erik Nelson

Nelson, Erik 1

Research facility: Bowdoin College

Published Oct 22, 2025 on Dryad. https://doi.org/10.5061/dryad.q573n5tww

Data files

Oct 22, 2025 version files 10.31 MB

Abstract

The impact of public transit (PT) on income sorting in U.S. cities has long been debated. Theory suggests that richer households may cluster near PT stations to minimize commute time – or avoid them in favor of more convenient automobile commuting. The equilibrium depends on factors such as PT speed relative to cars and the income gap between rich and poor households. Empirical evidence supports both possibilities, but prior multi-city studies suffer from identification flaws. Using data from 21 U.S. light-rail (LR) systems built or expanded since 1991, this study estimates the effect of new LR stations on nearby neighborhood incomes. My event-study design improves upon earlier work by constructing controls that match pre-treatment conditions and trends in treated station areas and by correcting for the bias that staggered treatment timing can introduce to event study estimates. Across the pooled sample, there is little evidence that new LR stations make surrounding neighborhoods poorer. In several cases, new stations increased nearby incomes. The effects of new stations on income sorting across individual urban areas are heterogeneous: in denser, low-car cities, LR stations tend to raise neighborhood incomes, while in more car-dependent cities, their influence is negligible.

This README.txt file was generated on 2025-10-04 by Erik Nelson.

GENERAL INFORMATION

Title of Dataset: The impact of light-rail stations on income sorting in US urban areas.
Author Information
Name: Erik Nelson
Institution: Bowdoin College
Address: 9700 College Station
Brunswick, ME 04011-8497.
Email: enelson2@bowdoin.edu

The Stata .do files in this depository generate the results that are plotted or presented in table format in the paper "The impact of light-rail stations on income sorting in US urban areas." All .do files load the needed datasets. All datasets are .xlsx format. Each Excel file contains data for the urban area that is part of the file's name. The data in each Excel file is in panel form. Each observation in a dataset represents a treated or control area i in urban area u in year t. We observe each area i's average nominal per capita and median HH income in year t = 1990, 2000, 2010, 2017, 2019, 2021, and 2022 (these are the Census years, technically income data was observed in 1989, 1999, 2006-2010, 2013-2017, 2015-2019, 2017-2021, and 2018-2022). We also observe i's centroid's coordinates (they do not change over time) and if the observation is a station area, the year the station opened.

The .do files estimate each i's distance to its central business district (CBD), convert nominal dollar values to real dollar values (2022 USD), drop irrelevant observations and variables for the analysis at hand, combine individual urban area u datasets into a pooled dataset, and finally estimate the Callaway and Sant'Anna (CS)-adjusted event studies over pooled data and individual urban area data.

Several figures were created using R code. These R scripts graph results that that were generated with the Stata .do files.

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
Links to publications that cite or use the data:
Erik Nelson. (2025). The impact of light-rail stations on income sorting in US urban areas.
Recommended citation for this dataset:
Erik Nelson. (2025). The impact of light-rail stations on income sorting in US urban areas. Dryad Digital Repository. https://doi.org/XXX

DATA & CODE OVERVIEW
The best way to understand how the data is used on our project is to list the .do and .r scripts used to create each figure or table in the paper. Each .do and .r script loads and uses the necessary .xlsx files.

Figure 4 uses,
A) SummaryStatistics.do.
B) The various urban area Excel worksheets.

Figure 5, SM Table 2, and SM Table 3 use,
A) PooledandIndResultsHistoricalControls.do.
B) PooledandIndResultsHistLessOutliersControls.do.
C) The various urban area Excel worksheets.

Figure 6, SM Table 4, and SM Table 5 use,
A) PooledandIndResultsNotYetOpenControls.do.
B) PooledandIndResultsAlreadyOpenControls.do.
C) The various urban area Excel worksheets.

Figure 7, SM Table 6, and SM Table 7 use,
A) PooledandIndResultsBufferControls.do.
B) PooledandIndResultsPlacebo.do.
C) PooledabdIndResultsUntargetedControls.do.
D) The various urban area Excel worksheets.

Figure 8 uses,
A) Figure8.r.
B) CorrelationsUpdate.xlsx.

Table 3 uses,
A) SummaryStatistics.do.
B) The various urban area Excel worksheets.

Table 4 uses,
A) PooledandIndResultsHistoricalControls.do.
B) PooledandIndResultsHistLessOutliersControls.do.
C) PooledandIndResultsNotYetOpenControls.do.
D) PooledandIndResultsAlreadyOpenControls.do.
E) PooledandIndResultsBufferControls.do.
E) PooledandIndResultsPlacebo.do.
F) PooledabdIndResultsUntargetedControls.do.
G) The various urban area Excel worksheets.

SM Figure 1 uses,
A) SummaryStatistics.do.
B) The various urban area Excel worksheets.

SM Figure 2 uses,
A) LAGraph.r.

SM Figure 3 uses,
A) DenverGraph.r.

SM Figure 4 uses,
A) DALGraph.r.

SM Figure 5 uses,
A) PortlandGraph.r.

SM Figures 6 - 11 uses,
A) SummaryStatistics.do.
B) The various urban area Excel worksheets.

SM Table 8 uses,
A) PooledandIndResultsHistoricalControlsTwoTenths.do.
B) PooledandIndResultsHistLessOutliersControlsTwoTenths.do.
C) PooledandIndResultsNotYetOpenControlsTwoTenths.do.
D) PooledandIndResultsAlreadyOpenControlsTwoTenths.do.
E) PooledandIndResultsBufferControlsTwoTenths.do.
F) The various urban area Excel worksheets.

SM Table 9 uses,
A) PooledandIndResultsHistoricalControlslnY.do.
B) PooledandIndResultsHistLessOutliersControlslnY.do.
C) PooledandIndResultsNotYetOpenControlslnY.do.
D) PooledandIndResultsAlreadyOpenControlslnY.do.
E) PooledandIndResultsBufferControlslnY.do.
F) PooledandIndResultsPlacebolnY.do.
G) The various urban area Excel worksheets.

SM Table 10 uses,
A) PooledandIndResultsHistoricalControlsClosetoCBD.do.
B) PooledandIndResultsHistLessOutliersControlsClosetoCBD.do.
C) PooledandIndResultsNotYetOpenControlsClosetoCBD.do.
D) PooledandIndResultsAlreadyOpenControlsClosetoCBD.do.
E) PooledandIndResultsBufferControlsClosetoCBD.do.
F) PooledandIndResultsPlaceboClosetoCBD.do.
G) The various urban area Excel worksheets.

SM Table 11 uses,
A) PooledandIndResultsHistoricalControlsNoAnti.do.
B) PooledandIndResultsHistLessOutliersControlsNoAnti.do.
C) PooledandIndResultsNotYetOpenControlsNoAnti.do.
D) PooledandIndResultsAlreadyOpenControlsNoAnti.do.
E) PooledandIndResultsBufferControlsNoAnti.do.
F) PooledandIndResultsPlaceboNoAnti.do.
G) The various urban area Excel worksheets.

#########################################################################
DATA-SPECIFIC INFORMATION FOR: .xlsx files with an urban area name in the file name but not the words "Placebo" or "Grid." All of these .xlsx files are in the zip file called "UrbanExcelFiles.zip"
Source: Various (see paper for details).

Number of variables: Varies
Number of cases/rows: Varies
Variable List: All .xlsx files contain a core set of variables, some .xlsx files contain variables unique to the given urban area. First, I list the variables that are found in each .xlsx file with an urban area name in the file name.
Stationname: Name of station or control area.
StationID: Unique station or control area ID.
openingyear: the year a station or buffer associated with a station opened; equals 0 if the observation is a historic control observation.
year: year of measured income.
y: y-coordinate of observation's centroid in decimal degrees format.
x: x-coordinate of observation's centroid in decimal degrees format.
halfmile: equals 1 if the observation began as a half-mile radius circle around a station centroid or a point along the historic streetcar line; equals 0 otherwise. I say 'began' because the half-mile radius circle could be truncated.
fourtenths: equals 1 if the observation began as a four-tenths mile radius circle; equals 0 otherwise. I say 'began' because the four-tenths mile radius circle could be truncated.
threetenths: equals 1 if the observation began as a three-tenths mile radius circle; equals 0 otherwise. I say 'began' because the three-tenths mile radius mile circle could be truncated. I say 'began' because the half-mile radius circle could be truncated.
twotenths: equals 1 if the observation began as a two-tenths mile radius circle; equals 0 otherwise. I say 'began' because the two-tenths mile radius circle could be truncated.
onetenths: equals 1 if the observation began as an one-tenths mile radius circle; equals 0 otherwise. I say 'began' because the one-tenths mile radius circle could be truncated.
expansion: equals 1 if the observation is a station slated to open after 2022; equals 0 otherwise.
historic: equals 1 if the observation is a historic streetcar line control; equals 0 otherwise.
control: equals 1 if the observation is a buffer control; equals 0 otherwise.
alreadyopen: equals 1 if the observation is a station that opened between 1980 and 1989; equals 0 otherwise.
pcapinc: the area's average per capita income in 'year.'
medhhinc: the area's average median HH income in 'year.'
city: the urban area's name.

In some .xlsx files there is also information on the line a station belongs to or stations that are part of a rapid bus line.

Missing data codes: Empty cell.
Specialized formats or other abbreviations used: None

#########################################################################
DATA-SPECIFIC INFORMATION FOR: .xlsx files with an urban area name and "Placebo" in the file name. All of these .xlsx files are in the zip file called "UrbanPlaceboExcelFiles.zip"

Source: Various (see paper for details).

Number of variables: Varies
Number of cases/rows: Varies
Variable List: All .xlsx files contain a core set of variables, some .xlsx files contain variables unique to the given urban area. First, I list the variables that are found in each .xlsx file with an urban area name in the file name.
Stationname: Name of station or control area
StationID: Unique station or control area ID
openingyear: the year a station or buffer associated with a station opened; equals 0 if the observation is a historic control observation.
year: year of measured income
y: y-coordinate of observation's centroid in decimal degrees format.
x: x-coordinate of observation's centroid in decimal degrees format.
halfmile: equals 1 if the observation began as a half-mile radius circle around a station centroid or a point along the historic streetcar line; equals 0 otherwise. I say 'began' because the half-mile radius circle could be truncated.
fourtenths: equals 1 if the observation began as a four-tenths mile radius circle; equals 0 otherwise. I say 'began' because the four-tenths mile radius circle could be truncated.
threetenths: equals 1 if the observation began as a three-tenths mile radius circle; equals 0 otherwise. I say 'began' because the three-tenths mile radius circle could be truncated. I say 'began' because the half-mile radius circle could be truncated.
twotenths: equals 1 if the observation began as a two-tenths mile radius circle; equals 0 otherwise. I say 'began' because the two-tenths mile radius circle could be truncated.
onetenths: equals 1 if the observation began as an one-tenths mile radius circle; equals 0 otherwise. I say 'began' because the one-tenths mile radius circle could be truncated.
expansion: equals 1 if the observation is a station slated to open after 2022; equals 0 otherwise.
historic: equals 1 if the observation is a historic streetcar line control; equals 0 otherwise.
control: equals 1 if the observation is a buffer control; equals 0 otherwise.
alreadyopen: equals 1 if the observation is a station that opened before 1989; equals 0 otherwise.
controltwo: equals 1 if the observation is a buffer's buffer control; equals 0 otherwise.
pcapinc: the area's average per capita income in 'year'
medhhinc: the area's average median HH income in 'year'
city: the urban area's name.

In some .xlsx files there is also information on the line a station belongs to or stations that are part of a rapid bus line.

Missing data codes: Empty cell.
Specialized formats or other abbreviations used: None

#########################################################################
DATA-SPECIFIC INFORMATION FOR: .xlsx files with an urban area name and "Grid" in the file name. All of these .xlsx files are in the zip file called "UrbanGridExcelFiles.zip"

Source: Various (see paper for details).

Number of variables: Varies
Number of cases/rows: Varies
Variable List: All .xlsx files contain a core set of variables, some .xlsx files contain variables unique to the given urban area. First, I list the variables that are found in each .xlsx file with an urban area name in the file name.
Stationname: Name of station or control area
StationID: Unique station or control area ID
openingyear: the year a station or buffer associated with a station opened; equals 0 if the observation is a historic control observation.
year: year of measured income
y: y-coordinate of observation's centroid in decimal degrees format.
x: x-coordinate of observation's centroid in decimal degrees format.
halfmile: equals 1 if the observation began as a half-mile radius circle around a treated station centroid. I say 'began' because the half-mile radius circle could be truncated.
control: equals 1 if the observation is a grid cell control; equals 0 otherwise.
pcapinc: the area's average per capita income in 'year'
medhhinc: the area's average median HH income in 'year'
city: the urban area's name.
Missing data codes: Empty cell.
Specialized formats or other abbreviations used: None

#########################################################################
DATA-SPECIFIC INFORMATION FOR: CorrelationsUpdate.xlsx.
Source: Various (see paper for details).

Number of variables: Varies
Number of cases/rows: Varies
Variable List: All .xlsx files contain a core set of variables, some .xlsx files contain variables unique to the given urban area. First, I list the variables that are found in each .xlsx file with an urban area name in the file name.
UrbanArea: name of urban area.
ATTHistHHInc: the urban area's CS-adjusted event study average ATT when historic controls and median HH income are used to estimate the even study model.
ATTHistPCInc: the urban area's CS-adjusted event study average ATT when historic controls and per capita income are used to estimate the even study model.
ATTBufHHInc: the urban area's CS-adjusted event study average ATT when buffer controls and median HH income are used to estimate the even study model.
ATTBufPCInc: the urban area's CS-adjusted event study average ATT when buffer controls and per capita income are used to estimate the even study model.
ATTUTHHInc: the urban area's CS-adjusted event study average ATT when untargeted grid cell controls and median HH income are used to estimate the even study model.
ATTUTPCInc: the urban area's CS-adjusted event study average ATT when untargeted grid cell controls and per capita income are used to estimate the even study model.
PopDen2020: urban area u's 2020 population density.
HomeValue: urban area u's 2019 median home value.
VehiclespHH: urban area u's 2019 vehicles per HH.
ATTHistHHIncSS: equals 1 if ATTHistHHInc is statistically significant at the p = 0.05 level; equals 0 otherwise.
ATTHistPCIncSS: equals 1 if ATTHistPCInc is statistically significant at the p = 0.05 level; equals 0 otherwise.
ATTBufHHIncSS: equals 1 if ATTBufHHInc is statistically significant at the p = 0.05 level; equals 0 otherwise.
ATTBufPCIncSS: equals 1 if ATTBufPCInc is statistically significant at the p = 0.05 level; equals 0 otherwise.
ATTUTHHIncSS: equals 1 if ATTUTHHInc is statistically significant at the p = 0.05 level; equals 0 otherwise.
ATTUTPCIncSS: equals 1 if ATTUTPCInc is statistically significant at the p = 0.05 level; equals 0 otherwise.
ATTHistHHIncPreTrend: equals 1 if the urban area's CS-adjusted event study estimate with historic controls and median HH income generates a pre-trend such that I cannot reject the null hypothesis H0: All pre-treatment coefficients = 0 at the p = 0.05 level; equals 0 otherwise.
ATTHistPCIncPreTrend: equals 1 if the urban area's CS-adjusted event study estimate with historic controls and per capita income generates a pre-trend such that I cannot reject the null hypothesis H0: All pre-treatment coefficients = 0 at the p = 0.05 level; equals 0 otherwise.
ATTBufHHIncPreTrend: equals 1 if the urban area's CS-adjusted event study estimate with buffer controls and median HH income generates a pre-trend such that I cannot reject the null hypothesis H0: All pre-treatment coefficients = 0 at the p = 0.05 level; equals 0 otherwise.
ATTBufPCIncPreTrend: equals 1 if the urban area's CS-adjusted event study estimate with buffer controls and per capita income generates a pre-trend such that I cannot reject the null hypothesis H0: All pre-treatment coefficients = 0 at the p = 0.05 level; equals 0 otherwise.
ATTUTHHIncPreTrend: equals 1 if the urban area's CS-adjusted event study estimate with untargeted grid cell controls and median HH income generates a pre-trend such that I cannot reject the null hypothesis H0: All pre-treatment coefficients = 0 at the p = 0.05 level; equals 0 otherwise.
ATTUTPCIncPreTrend: equals 1 if the urban area's CS-adjusted event study estimate with untargeted grid cell controls and per capita income generates a pre-trend such that I cannot reject the null hypothesis H0: All pre-treatment coefficients = 0 at the p = 0.05 level; equals 0 otherwise.
Missing data codes: Empty cell.
Specialized formats or other abbreviations used: None