Urbanev: An open benchmark dataset for urban electric vehicle charging demand prediction

Li, Han 1 ; Qu, Haohao2 ; Tan, Xiaojun1 ; You, Linlin1 ; Zhu, Rui3 ; Fan, Wenqi2

Published Mar 17, 2025; Updated Apr 25, 2025 on Dryad. https://doi.org/10.5061/dryad.np5hqc04z

Data files

Mar 17, 2025 version files 101.83 MB

README.md
14.28 KB
UrbanEVDataset.zip
101.81 MB

Apr 25, 2025 version files 320.23 MB

README.md
14.66 KB
UrbanEVDataset.zip
320.22 MB

Abstract

The recent surge in electric vehicles (EVs), driven by a collective push to enhance global environmental sustainability, has underscored the significance of exploring EV charging prediction. To catalyze further research in this domain, we introduce UrbanEV—an open dataset showcasing EV charging space availability and electricity consumption in a pioneering city for vehicle electrification, namely Shenzhen, China. UrbanEV offers a rich repository of charging data (i.e., charging occupancy, duration, volume, and price) captured at hourly intervals across an extensive six-month span for over 20,000 individual charging stations. Beyond these core attributes, the dataset also encompasses diverse influencing factors like weather conditions and spatial proximity. These factors are thoroughly analyzed qualitatively and quantitatively to reveal their correlations and causal impacts on charging behaviors. Furthermore, comprehensive experiments have been conducted to showcase the predictive capabilities of various models, including statistical, deep learning, and transformer-based approaches, using the UrbanEV dataset. This dataset is poised to propel advancements in EV charging prediction and management, positioning itself as a benchmark resource within this burgeoning field.

Data description

The UrbanEV dataset was developed to meet the urgent need for understanding and forecasting electric vehicle (EV) charging demand in urban environments. As global EV adoption accelerates, efficient charging infrastructure management is crucial for ensuring grid stability and enhancing user experience. Collected from public EV charging stations in Shenzhen, China — a leading city in vehicle electrification — the dataset covers a six-month period (September 1, 2022, to February 28, 2023), capturing seasonal variations in charging patterns. To ensure data quality, the raw records underwent meticulous preprocessing, including the extraction of key information (availability status, rated power, and fees), anomaly removal, and missing value imputation via forward and backward filling. Outliers identified by the IQR method were replaced with adjacent valid values. The data was aggregated both temporally (hourly) and spatially (by traffic zones), with variance tests and zero-value filtering applied to exclude low-activity regions. The final dataset includes charging data (occupancy, duration, and volume), weather conditions, spatial features (adjacency matrices and distances), and static attributes (Points of Interest, area size, and road length).

To evaluate the dataset’s utility in EV charging demand prediction, a benchmarking study was conducted using three traditional forecasting methods, five deep learning models, and two Transformer-based predictors. Model performance was assessed via RMSE, MAPE, RAE, and MAE across three tasks: distribution prediction, node prediction, and factorial experiments. The first task examined spatial dependencies and global demand patterns, while the second focused on localized temporal characteristics. The factorial experiments assessed the influence of auxiliary factors (electricity prices, service fees, temperature, pressure, and humidity) on charging occupancy.

Files and Variables

File: UrbanEVDataset.zip

Description: The UrbanEVDataset archive contains raw electric vehicle (EV) charging data at individual charging stations (recorded at 5-minute intervals) as well as processed EV charging data aggregated at the administrative zone level (provided in both 5-minute and hourly intervals). Additionally, it includes static spatial data for charging stations and administrative zones, along with corresponding meteorological datasets. This comprehensive dataset enables researchers to replicate experimental outcomes and perform extensive analyses on EV charging behaviors.

Folder - 20220901-20230228_station-raw: Contains raw EV charging data collected over a six-month period (from September 1, 2022, to February 28, 2023) at the charging station level. Some stations exhibit missing data points, which can be resolved using appropriate imputation techniques.

Subfolder - charge_5min: Includes 1,682 CSV files (from 1001.csv to 2682.csv), each corresponding to data from a specific charging station. Each file contains identical fields:
- time: Timestamp, ranging from 2022-09-01 00:00 to 2023-02-28 23:00, format: YYYY-MM-DD HH:MM (Unit: None).
- busy: Number of occupied charging piles (Unit: None).
- idle: Number of available (idle) charging piles (Unit: None).
- s_price: Service fee per kilowatt-hour of electricity (Unit: CNY/kWh).
- e_price: Electricity price per kilowatt-hour (Unit: CNY/kWh).
- fast_busy: Number of occupied fast charging piles (Unit: None).
- fast_idle: Number of available fast charging piles (Unit: None).
- slow_busy: Number of occupied slow charging piles (Unit: None).
- slow_idle: Number of available slow charging piles (Unit: None).
- duration: Total charging duration aggregated across all charging piles (Unit: hours).
- volume: Total energy dispensed, computed by multiplying charging duration with respective charging pile power ratings (Unit: kWh).
File - pile_rated_power.csv: Contains information on 22,650 individual charging piles. Fields include:
- pileNo: Unique identifier for each charging pile. (Unit: None)
- power: Rated power of the charging pile (Unit: kW).
- pileType: Type of charging pile, categorized as Direct Current (DC) or Alternating Current (AC). (Unit: None)
- station_id: Unique identifier of the charging station. (Unit: None)
File - station_distance.csv: Distance matrix among the 1,682 charging stations (Unit: meters).
File - station_information.csv: Static information for 1,682 charging stations. Fields include:
- station_id: Unique identifier for each charging station. (Unit: None)
- longitude: Longitude coordinate of the charging station in WGS-84 coordinate system. (Unit: None)
- latitude: Latitude coordinate of the charging station in WGS-84 coordinate system. (Unit: None)
- slow_count: Count of slow charging piles at the station. (Unit: None)
- fast_count: Count of fast charging piles at the station. (Unit: None)
- charge_count: Total number of charging piles at the station. (Unit: None)
- TAZID: Identifier of the administrative zone where the station is located. (Unit: None)

Folder - 20220901-20230228_zone-cleaned-aggregated: Provides cleaned and aggregated EV charging data at the administrative zone level. Erroneous data have been corrected, and missing values have been imputed. Data are available at both 5-minute and hourly intervals and include adjacency matrices, distance matrices, and static spatial attributes.

Subfolder - charge_5min/charge_1hour: Each subfolder contains datasets for 275 administrative zones with corresponding temporal resolutions. All data files within this directory have an identical structure: the first column (time) records timestamps ranging from 2022-09-01 00:00 to 2023-02-28 23:00 (format: YYYY-MM-DD HH:MM, Unit: None). Columns 2 to 276 contain charging-related data for each zone. The first row of these columns indicates the zone ID (TAZID), and subsequent rows provide charging information specific to each dataset (e.g., occupancy rate for occupancy.csv, charging duration for duration.csv). Files include:
- duration.csv: Aggregated EV charging duration per time interval (Unit: hours).
- e_price.csv: Electricity prices (Unit: CNY/kWh).
- occupancy.csv: Occupancy rate of charging stations per time interval (Unit: %).
- s_price.csv: Service fees per kilowatt-hour (Unit: CNY/kWh).
- volume.csv: Aggregated EV charging volume per time interval (Unit: kWh).
- volume-11kW.csv: Supplementary charging volume data calculated using a standardized 11 kW rating for Tesla Model Y vehicles. (Unit: kWh).
File - adj.csv: Adjacency matrix of 275 administrative zones.
File - distance.csv: Distance matrix among 275 administrative zones (Unit: meters).
File - zone-information.csv: Spatial and charging pile information at the administrative zone level. Fields include:
- TAZID: Unique identifier for administrative zones. (Unit: None)
- longitude: Longitude coordinate of the zone centroid in WGS-84 coordinate system. (Unit: None)
- latitude: Latitude coordinate of the zone centroid in WGS-84 coordinate system. (Unit: None)
- charge_count: Total number of charging piles within the zone. (Unit: None)
- area: Area of the administrative zone (Unit: square meters).
- perimeter: Perimeter of the administrative zone (Unit: meters).

Code/software

Description: This archive contains code for distribution time-series prediction using both traditional and deep learning models based on the UrbanEV dataset. It includes modularized functions that assist researchers in efficiently reproducing data verification and analysis conclusions by providing comprehensive code support.

Software Requirements

To view and analyze the data, open-source software such as Microsoft Excel can be used for visualizing CSV files. Additionally, all data processing, analysis, and experimental validation are performed using Python and its related open-source libraries. The required dependencies and setup instructions are detailed below.

Environment Setup and Workflow

Environment Setup: Run the init_env.bat or init_env.sh script to create a virtual environment and install the required dependencies.
Data Placement: After setting up the environment, download and unzip the UrbanEVDataset.zip file. Place the folder 20220901-20230228_zone-cleaned-aggregated/charge_1hour containing the relevant charging data CSV files into the data folder. Additionally, include the following files from 20220901-20230228_zone-cleaned-aggregated in the data folder:
- adj.csv (Adjacency matrix for regions)
- distance.csv (Distance matrix for regions)
- zone-information.csv (Static information for regions; rename this file to inf.csv to avoid potential errors)
Experiment Execution: Run the exp.bat or exp.sh script to start time-series prediction experiments.
Output Management: Model checkpoints are saved in the checkpoints folder, and experiment results are stored in the results folder within the code directory.

Source Files

code: Contains code for distribution time-series prediction utilizing traditional and deep learning models based on the UrbanEV dataset. Key files include:
- baselines.py: Implements three traditional forecasting methods (Last Observation, Auto-regressive (AR), and ARIMA) along with six deep learning models (Fully Connected Neural Network (FCNN), Long Short-Term Memory (LSTM), Graph Convolutional Network (GCN), GCN-LSTM, and Attention-Based Spatial-Temporal Graph Convolutional Network (ASTGCN)).
- exp.bat/exp.sh: Scripts for initiating distribution time-series prediction tasks.
- init_env.bat/init_env.sh: Scripts to set up a virtual environment for running time-series predictions using the UrbanEV dataset.
- main.py: The main execution script.
- parse.py: Provides a command-line interface for configuring training parameters for spatiotemporal EV charging demand prediction models.
- preprocess.py: Converts data in the ./data/dataset folder into a format suitable for Transformer-based time-series models.
- train.py: Model training script.
- utils.py: Utility functions designed for UrbanEV dataset prediction tasks, including time-series cross-validation and dataset preparation.
code_transformer: Contains code for time-series prediction using Transformer-based models on the UrbanEV dataset. Key folders and files include:
- data_provider: Includes data_factory.py and data_loader.py, which prepare UrbanEV data in a format compatible with models like TimeXer.
- exp: Contains exp_basic.py and exp_long_term_forecasting.py, which define the time-series prediction tasks for Transformer models such as TimeXer.
- layers: Includes Conv_Blocks.py, Embed.py, and SelfAttention_Family.py, which define core layers used in TimeXer and TimesNet models.
- models: Includes TimesNet.py and TimesXer.py, which implement the complete structure and functionality of TimesNet and TimeXer models.
- utils: Contains masking.py, metrics.py, print_args, timefeatures.py, and tools.py, providing utility modules essential for the training and testing processes.
- exp.bat/exp.sh: Scripts for initiating Transformer-based model predictions.
- run.py: The main program for Transformer-based time-series predictions.

Supplemental information

Description: This supplemental dataset provides geographic boundaries, weather data, and points-of-interest (POI) for Shenzhen city, facilitating analysis of urban electric vehicle (EV) charging behavior in relation to infrastructure, climate, and local businesses.

File: UrbanEVSupplemental.zip

Folder - shenzhen_districts: Contains geographic data representing administrative districts of Shenzhen in ArcGIS format, using the WGS 1984 Albers coordinate system. Files include:
- shenzhen.shp: Primary shapefile containing geometric boundary information of administrative districts in Shenzhen.
- shenzhen.shx: Shapefile index, facilitating spatial indexing and rapid data retrieval.
- shenzhen.dbf: Database file providing attribute information for each administrative district, including district names, areas, and perimeters.
File - 20220901-20230228_weather_central.csv: Weather data sourced from the Futian Central Meteorological Station (central Shenzhen), normalized via Min-Max scaling to facilitate correlation analyses with EV charging patterns. Fields include:
- T: Air temperature (Unit: °C).
- P0: Atmospheric pressure at station altitude (Unit: mmHg).
- P: Sea-level atmospheric pressure (Unit: mmHg).
- U: Relative humidity (Unit: %).
- RAIN: Rain classification (0-no rain, 1-light rain, 2-moderate rain, 3-heavy rain). (Unit: None)
- Td: Dewpoint temperature (Unit: °C).
File - 20220901-20230228_weather_airport.csv: Supplementary weather data from Bao'an Airport Meteorological Station (Shenzhen), with fields identical to the central meteorological station data. Fields include:
- T: Air temperature (Unit: °C).
- P0: Atmospheric pressure at station altitude (Unit: mmHg).
- P: Sea-level atmospheric pressure (Unit: mmHg).
- U: Relative humidity (Unit: %).
- RAIN: Rain classification (0-no rain, 1-light rain, 2-moderate rain, 3-heavy rain). (Unit: None)
- Td: Dewpoint temperature (Unit: °C).
File - 20221201-shenzhen-poi.csv: Points of Interest (POI) data for Shenzhen, collected as of December 1, 2022. Fields include:
- longitude: Longitude coordinate of the point of interest in WGS-84 coordinate system. (Unit: None)
- latitude: Latitude coordinate of the point of interest in WGS-84 coordinate system. (Unit: None)
- primary_types: Category of POI, specifically including "Food and beverage" establishments. (Unit: None)

Version changes

23-Apr-2025:

Replaced previously processed 1-hour resolution data with raw 5-minute resolution data for each charging station in the 20220901-20230228_station-raw/charge_5min directory.
Updated occupancy.csv in 20220901-20230228_zone-cleaned-aggregated/charge_1hour to report the number of occupied charging piles, instead of occupancy rate (i.e., proportion of occupied piles), to ensure consistency with occupancy.csv files in other directories.

Access Information

Other publicly accessible locations of the data

GitHub Repository

To build a comprehensive and reliable benchmark dataset, we conduct a series of rigorous processes from data collection to dataset evaluation. The overall workflow sequentially includes data acquisition, data processing, statistical analysis, and prediction assessment. As follows, please see detailed descriptions.

Study area and data acquisition

Through this platform, users could access real-time information on each charging pile, including its availability (e.g., busy or idle), charging price, and geographic coordinates.

Accordingly, we recorded the charging-related data at five-minute intervals from September 1, 2022, to February 28, 2023. This data collection process was fully digital and did not require manual readings. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city were acquired from two meteorological observatories situated in the airport and central regions, respectively. These meteorological data are publicly available on the Shenzhen Government Data Open Platform. Thirdly, point of interest (POI) data was extracted through the Application Programming Interface Platform of AMap.com, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions.

Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, a program was employed to extract the status (e.g., busy or idle, charging price, electricity volume, and coordinates) of each charging pile at five-minute intervals from 1 September 2022 to 28 February 2023. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city was acquired from two meteorological observatories situated in the airport and central regions, respectively. Thirdly, point of interest (POI) data was extracted, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions.

Processing raw information into well-structured data

To streamline the utilization of the UrbanEV dataset, we harmonize heterogeneous data from various sources into well-structured data with aligned temporal and spatial resolutions. This process can be segmented into two parts: the reorganization of EV charging data and the preparation of other influential factors.

EV charging data

The raw charging data, obtained from publicly available EV charging services, pertains to charging stations and predominantly comprises string-type records at a 5-minute interval. To transform this raw data into a structured time series tailored for prediction tasks, we implement the following three key measures:

Initial Extraction. From the string-type records, we extract vital information for each charging pile, such as availability (designated as "busy" or "idle"), rated power, and the corresponding charging and service fees applicable during the observed time periods. First, a charging pile is categorized as "active charging" if its states at two consecutive timestamps are both "busy". Consequently, the occupancy within a charging station can be defined as the count of in-use charging piles, while the charging duration is calculated as the product of the count of in-use piles and the time between the two timestamps (in our case, 5 minutes). Moreover, the charging volume in a station can correspondingly be estimated by multiplying the duration by the piles' rated power. Finally, the average electricity price and service price are calculated for each station in alignment with the same temporal resolution as the three charging variables.
Error Detection and Imputation. Ensuring data quality is paramount when utilizing charging data for decision-making, advanced analytics, and machine-learning applications. It is crucial to address concerns around data cleanliness, as the presence of inaccuracies and inconsistencies, often referred to as dirty data, can significantly compromise the reliability and validity of any subsequent analysis or modeling efforts. To improve data quality of our charging data, several errors are identified, particularly the negative values for charging fees and the inconsistencies between the counts of occupied, idle, and total charging piles. We remove the records containing these anomalies and treat them as missing data. Besides that, a two-step imputation process was implemented to address missing values. First, forward filling replaced missing values using data from preceding timestamps. Then, backward filling was applied to fill gaps at the start of each time series. Moreover, a certain number of outliers were identified in the dataset, which could significantly impact prediction performance. To address this, the interquartile range (IQR) method was used to detect outliers for metrics including charging volume (v), charging duration (d), and the rate of active charging piles at the charging station (o). To retain more original data and minimize the impact of outlier correction on the overall data distribution, we set the coefficient to 4 instead of the default 1.5. Finally, each outlier was replaced by the mean of its adjacent valid values. This preprocessing pipeline transformed the raw data into a structured and analyzable dataset.
Aggregation and Filtration. Building upon the station-level charging data that has been extracted and cleansed, we further organize the data into a region-level dataset with an hourly interval providing a new perspective for EV charging behavior analysis. This is achieved by two major processes: aggregation and filtration. First, we aggregate all the charging data from both temporal and spatial views: a. Temporally, we standardize all time-series data to a common time resolution of one hour, as it serves as the least common denominator among the various resolutions. This aims to establish a unified temporal resolution for all time-series data, including pricing schemes, weather records, and charging data, thereby creating a well-structured dataset. Aggregation rules specify that the five-minute charging volume v and duration $(d)$ are summed within each interval (i.e., one hour), whereas the occupancy o, electricity price p_e, and service price p_s are assigned specific values at certain hours for each charging pile. This distinction arises from the inherent nature of these data types: volume v and duration d are cumulative, while o, p_e, and p_s are instantaneous variables. Compared to using the mean or median values within each interval, selecting the instantaneous values of o, p_e, and p_sas representatives preserves the original data patterns more effectively and minimizes the influence of human interpretation. b. Spatially, stations are aggregated based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. After aggregation, our aggregated dataset comprises 331 regions (also called traffic zones) with 4344 timestamps. Second, variance tests and zero-value filtering functions were employed to filter out traffic zones with zero or no change in charging data. Specifically, it means that regions with an occupancy variance below 0.001 or a proportion of zero values exceeding 30% were excluded. As a result, 275 traffic zones are ultimately retained, encompassing a total of 1,362 charging stations and 17,532 charging piles, for subsequent usage.

Other influential factors

Apart from the EV charging data, we also constructed a set of variables that might influence charging behaviors. These variables can be categorized into three classes, namely temporal factors, spatial attributes, and static features. First and foremost, the temporal factors include three weather conditions: air temperature (T_a), relative humidity (h), and atmospheric pressure (P). The raw weather data is collected from two meteorological observatories located in the airport and central regions of Shenzhen, and they were further organized into numeric data with the same hourly interval as the structured charging data. Notably, the weather data is shared by all the charging stations or traffic zones. Furthermore, spatial information, such as the adjacency matrix and distances, is computed using ArcGIS tools. Specifically, the adjacency is determined by checking whether two traffic zones have adjacent edges based on the distance calculated from their geometric centers. Lastly, UrbanEV also provides static features including Point of Interest (POI), area, and road length in each traffic zone. Moreover, we filtered and selected only those that are relevant to charging activities within the 275 specific zones aligned with the structured charging data.