Active region magnetograms for solar flare prediction: Full resolution dataset
Data files
Oct 15, 2023 version files 1.11 GB
-
C1.0_24hr_Labels.txt
-
Lat60_Lon60_Nans0_C1.0_24hr_features.csv
-
List_of_AR_in_Test_Data_by_AR.csv
-
List_of_AR_in_Train_Data_by_AR.csv
-
List_of_AR_in_Validation_data_by_AR.csv
-
README.md
-
Test_Data_by_AR.csv
-
Train_Data_by_AR.csv
-
Validation_data_by_AR.csv
Abstract
In this dataset, we provide a comprehensive collection of magnetograms from the National Aeronautics and Space Administration's (NASA's) Solar Dynamics Observatory (SDO). The dataset incorporates data from three sources and provides SDO Helioseismic and Magnetic Imager (HMI) magnetograms of solar active regions as well as labels of corresponding flaring activity. This dataset will be useful for image analysis or solar physics research related to magnetic structure, its evolution over time, and its relation to solar flares. The dataset will be of interest to those researchers investigating automated solar flare prediction methods, including supervised and unsupervised machine learning (classical and deep), binary and multi-class classification, and regression. This dataset is a minimally processed, user configurable dataset of consistently sized images of solar active regions that can serve as a benchmark dataset for solar flare prediction research. This dataset consists of full resolution images (see usage notes below).
README: Active Region Magnetograms for Solar Flare Prediction: Full Resolution Dataset
In this dataset we provide a comprehensive collection of magnetograms from the National Aeronautics and Space Administration's (NASA's) Solar Dynamics Observatory (SDO). The dataset incorporates data from three sources and provides SDO Helioseismic and Magnetic Imager (HMI) magnetograms of solar active regions as well as labels of corresponding flaring activity. This dataset will be useful for image analysis or solar physics research related to magnetic structure, its evolution over time, and its relation to solar flares. The dataset will be of interest to those researchers investigating automated solar flare prediction methods, including supervised and unsupervised machine learning (classical and deep), binary and multi-class classification, and regression.
This preconfigured dataset consists of full resolution images generated from magnetograms of National Oceanic and Atmospheric Administration (NOAA) active regions (ARs) from 01 May 2010 through 31 December 2018 that are within +/- 60 degrees latitude and longitude and contain no not-a-number (NaN) pixels. Labels are provided for these images according to whether the region flared at a level of GOES size >C1.0 within the subsequent 24 hours from the image acquisition. Binary classification labels are provided according to whether the flare exceeded a Geostationary Operational Environment Satellite (GOES) flare size of C1.0 (1
) or not (0
). Regression labels are provided as the GOES flare size (e.g., 'C4.7'
) for flaring examples and 0
for non-flaring examples. These images are provided as 600x600 pixel .fits
images. In total, this full resolution dataset contains 950,047 magnetogram images from 1570 ARs.
This dataset is described in detail in the paper at https://arxiv.org/abs/2305.09492 and is related to the code described in the github repository at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348. The image data and corresponding labels are provided in this dataset, as well as additional data useful for two machine learning problems of flare prediction: 1) a classical machine learning problem using features of magnetic complexity and a support vector machine (SVM) classifier and 2) a deep learning problem using transfer learning on the VGG network. These data are organized to be useful for other classical and deep learning problems, including both classification and regression.
Researchers interested in the reduced resolution dataset (224x224 pixel .png
images) may be interested in the Dryad repository https://doi.org/10.5061/dryad.jq2bvq898.
Researchers interested in configuring a custom dataset according to other criteria of latitude, longitude, NaNs, flare size, and flare window may be interested in the extra images dataset at https://doi.org/10.5061/dryad.qjq2bvqmj and the code described under "General Code" in the github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348. Note--Researchers wishing to work with the entire dataset (all 1,357,004 images) to configure a custom dataset must combine the files from this full resolution preconfigured dataset and the extra images dataset https://doi.org/10.5061/dryad.qjq2bvqmj by moving/copying the subdirectories to a common base directory, e.g., active_regions/
.
Description of the data and file structure
The image dataset:
The following files comprise the image dataset itself.
- Image files: Images are provided in a directory structure consisting of 1570 directories named
NNNN/
, the four digit NOAA AR number. These files will extract to a directory structureLat60_Lon60_Nans0_png_224/NNNN
, where each of the subdirectoriesNNNN/
contains a variable number of.fits
files of the ARNNNN
that satisified the criteria of latitude, longitude, and acceptable number of NaNs. Each.fits
file is prepended with the NOAA AR number for ease of correspondence. There are a total of 950,047 images in the dataset. These images are available on zenodo at the following links:- ARs 1064 through 1306: https://doi.org/10.5281/zenodo.7846553
- ARs 1307 through 1505: https://doi.org/10.5281/zenodo.7852935
- ARs 1506 through 1707: https://doi.org/10.5281/zenodo.7863946
- ARs 1708 through 1918: https://doi.org/10.5281/zenodo.7869137
- ARs 1919 through 2103: https://doi.org/10.5281/zenodo.7871173
- ARs 2104 through 2283: https://doi.org/10.5281/zenodo.7876156
- ARs 2284 through 2488: https://doi.org/10.5281/zenodo.7883556
- ARs 2489 through 2731: https://doi.org/10.5281/zenodo.7886785
C1.0_24hr_Labels.txt
: a file containing the labels for each of the images in the dataset. The labels are formatted to provide both the regression and classification labels in a form that can be parsed for other applications. Each line in the file is of the formfilename,label
wherefilename
is the base filename in the image set andlabel
is the label. The label is formatted as a stringKX.X
for flaring regions, whereK
is the GOES flare class (C
,M
, orX
) andX.X
is the size, e.g.,4.7
. Non-flaring regions are assigned a label of'0'
.List_of_AR_in_Train_Data_by_AR.csv
,List_of_AR_in_Validation_Data_by_AR.csv
,List_of_AR_in_Test_Data_by_AR.csv
: files containing lists of NOAA ARs assigned to the training, validation, and test sets, respectively. These dataset splits are those used in the paper available at https://arxiv.org/abs/2305.09492. Each line in the files is of the formatNNNN
, the four digit NOAA AR number. Note--these lists are identical between this full resolution dataset and the reduced resolution dataset (available at https://doi.org/10.5061/dryad.jq2bvq898).
Magnetic complexity features for the dataset:
The following file comprises data related to magnetic complexity features of the image dataset. These features were extracted from the full resolution images in the dataset and used in a classical machine learning problem in the paper available at https://arxiv.org/abs/2305.09492. The feature extraction and SVM classification code is available on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348.
Lat60_Lon60_Nans0_C1.0_24hr_features.csv
: file with the 29 magnetic complexity features extracted from each of the full resolution images in the preconfigured dataset. Each line of the file contains 32 comma separated values. The first 29 values are the 29 magnetic complexity features as described in the paper at https://arxiv.org/abs/2305.09492. The last three values are the classification label (1
or0
), regression label (flare size as as stringKX.X
or0
), and the base filename. The regression label is formatted as a stringKX.X
for flaring regions, whereK
is the GOES flare class (C
,M
, orX
) andX.X
is the size, e.g.,4.7
.
Dataframes for use in tensorflow:
The following files comprise data helpful for using the image dataset in tensorflow code. The code for transfer learning using this dataset with the VGG network is available on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348.
Train_Data_by_AR.csv
,Validation_Data_by_AR.csv
,Test_Data_by_AR.csv
: files with labels for each of the images in the preconfigured dataset formatted to provide classification labels in the format expected by a dataframe loader in tensorflow for the training, validation, and test sets, respectively. Each line is of the formNNNN/filename,label
whereNNNN
is the AR directory,filename
is the base filename, andlabel
is the classification label (1
for flaring and0
for nonflaring). Researchers interested in developing dataframes for a different dataset split may be interested in the code provided on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348. Note--the use of these full resolution.fits
files also require the use of a custom dataloader since the native tensorflow dataloaders cannot input.fits
files. A custom data loader is provided as part of the transfer learning code on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348.
Sharing/access information
This dataset incorporates data from three main sources.
- First, in order to focus the image collection on ARs, we used the NOAA Space Weather Prediction Center (SWPC) Solar Region Summaries (SRS) ftp://ftp.swpc.noaa.gov/pub/warehouse/ and parsed those text data to extract the date an AR appeared on disk and the number of days it was visible on disk. Additionally, the SRS provide latitude and longitude of ARs which we use to postprocess the dataset.
- Second, we downloaded magnetogram images from SDO/HMI using the Joint Science Operations Center (JSOC) interface http://jsoc.stanford.edu/ajax/lookdata.html at a cadence of 720 seconds, centered at the NOAA AR centroid (tracked according to the Carrington rate), and with a spatial extent of 600x600 pixels.
- Third, we used the SWPC Event Reports (ER) ftp://ftp.swpc.noaa.gov/pub/warehouse/ to extract the AR number, peak flare time, and flare size in order to provide labels for those researchers investigating a supervised classification or regression problem.
Code/Software
The code used for the curation of this dataset as well as flare prediction are provided on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348. The github repository provides further details on how to run the code.
Methods
This dataset incorporates data from three main sources. First, in order to focus the image collection on ARs, we used the NOAA Space Weather Prediction Center (SWPC) Solar Region Summaries (SRS) (ftp://ftp.swpc.noaa.gov/pub/warehouse/) and parsed those text data to extract the date an AR appeared on disk and the number of days it was visible on disk. Additionally, the SRS provide latitude and longitude of ARs which we use to postprocess the dataset. Second, we download magnetogram images from SDO/HMI using the Joint Science Operations Center (JSOC) interface (http://jsoc.stanford.edu/ajax/lookdata.html) at a cadence of 720 seconds, centered at the NOAA AR centroid (tracked according to the Carrington rate), and with a spatial extent of 600x600 pixels. Third, we used the SWPC Event Reports (ER) (ftp://ftp.swpc.noaa.gov/pub/warehouse/) to extract the AR number, peak flare time, and flare size in order to provide labels for those researchers investigating a supervised classification or regression problem.
Usage notes
Image data are provided in supplementary files available on zenodo (see links under Related works) as .fits files which can be opened with the python package astropy (https://www.astropy.org/). All other files included here are text files that can be opened with any standard text manipulation software. We do note, however, that many text files are very large (~1M lines), and may take a while to load.
This is one of three datasets related to the same study:
Reduced resolution dataset: Reduced resolution images (950,047 images, each of which is 224x224 pixels and 8-bit depth resolution), https://doi.org/10.5061/dryad.jq2bvq898.
Full resolution dataset (this dataset): Full resolution images (950,047 images, each of which is 600x600 pixels and 17-bit depth resolution), https://doi.org/10.5061/dryad.dv41ns23n.
Extra images dataset: Images that were excluded from the main analyses in the first and second datasets (421,957 images that were excluded for latitude, longitude, and/or NaN pixels), https://doi.org/10.5061/dryad.qjq2bvqmj. Researchers wishing to work with the entire dataset (all 1,357,004 images) must combine the files from the full resolution preconfigured dataset (https://doi.org/10.5061/dryad.dv41ns23n) and the extra images dataset (https://doi.org/10.5061/dryad.qjq2bvqmj) by moving/copying the subdirectories to a common base directory, e.g., active_regions/.