Active region magnetograms for solar flare prediction: Reduced resolution dataset
Data files
Apr 06, 2023 version files 584.49 MB
-
C1.0_24hr_224_png_Labels.txt
57.57 MB
-
Lat60_Lon60_Nans0_C1.0_24hr_png_224_features.csv
465.15 MB
-
List_of_AR_in_Test_Data_by_AR.csv
785 B
-
List_of_AR_in_Train_Data_by_AR.csv
6.28 KB
-
List_of_AR_in_Validation_data_by_AR.csv
785 B
-
README.md
8.85 KB
-
Test_Data_by_AR_png_224.csv
6.16 MB
-
Train_Data_by_AR_png_224.csv
49.36 MB
-
Validation_Data_by_AR_png_224.csv
6.24 MB
Apr 19, 2023 version files 584.49 MB
-
C1.0_24hr_224_png_Labels.txt
57.57 MB
-
Lat60_Lon60_Nans0_C1.0_24hr_png_224_features.csv
465.15 MB
-
List_of_AR_in_Test_Data_by_AR.csv
785 B
-
List_of_AR_in_Train_Data_by_AR.csv
6.28 KB
-
List_of_AR_in_Validation_data_by_AR.csv
785 B
-
README.md
8.97 KB
-
Test_Data_by_AR_png_224.csv
6.16 MB
-
Train_Data_by_AR_png_224.csv
49.36 MB
-
Validation_Data_by_AR_png_224.csv
6.24 MB
May 18, 2023 version files 584.49 MB
-
C1.0_24hr_224_png_Labels.txt
57.57 MB
-
Lat60_Lon60_Nans0_C1.0_24hr_png_224_features.csv
465.15 MB
-
List_of_AR_in_Test_Data_by_AR.csv
785 B
-
List_of_AR_in_Train_Data_by_AR.csv
6.28 KB
-
List_of_AR_in_Validation_data_by_AR.csv
785 B
-
README.md
10.38 KB
-
Test_Data_by_AR_png_224.csv
6.16 MB
-
Train_Data_by_AR_png_224.csv
49.36 MB
-
Validation_Data_by_AR_png_224.csv
6.24 MB
Oct 15, 2023 version files 584.49 MB
Abstract
In this dataset, we provide a comprehensive collection of magnetograms from the National Aeronautics and Space Administration's (NASA's) Solar Dynamics Observatory (SDO). The dataset incorporates data from three sources and provides SDO Helioseismic and Magnetic Imager (HMI) magnetograms of solar active regions as well as labels of corresponding flaring activity. This dataset will be useful for image analysis or solar physics research related to magnetic structure, its evolution over time, and its relation to solar flares. The dataset will be of interest to those researchers investigating automated solar flare prediction methods, including supervised and unsupervised machine learning (classical and deep), binary and multi-class classification, and regression. This dataset is a minimally processed, user configurable dataset of consistently sized images of solar active regions that can serve as a benchmark dataset for solar flare prediction research. This dataset consists of reduced resolution images (see usage notes below).
README: Active Region Magnetograms for Solar Flare Prediction: Reduced Resolution Dataset
In this dataset we provide a comprehensive collection of magnetograms from the National Aeronautics and Space Administration's (NASA's) Solar Dynamics Observatory (SDO). The dataset incorporates data from three sources and provides SDO Helioseismic and Magnetic Imager (HMI) magnetograms of solar active regions as well as labels of corresponding flaring activity. This dataset will be useful for image analysis or solar physics research related to magnetic structure, its evolution over time, and its relation to solar flares. The dataset will be of interest to those researchers investigating automated solar flare prediction methods, including supervised and unsupervised machine learning (classical and deep), binary and multi-class classification, and regression.
This preconfigured dataset consists of reduced resolution images generated from magnetograms of National Oceanic and Atmospheric Administration (NOAA) active regions (ARs) from 01 May 2010 through 31 December 2018 that are within +/- 60 degrees latitude and longitude and contain no not-a-number (NaN) pixels. Labels are provided for these images according to whether the region flared at a level of GOES size >C1.0 within the subsequent 24 hours from the image acquisition. Binary classification labels are provided according to whether the flare exceeded a Geostationary Operational Environment Satellite (GOES) flare size of C1.0 (1
) or not (0
). Regression labels are provided as the GOES flare size (e.g., 'C4.7'
) for flaring examples and 0
for non-flaring examples. These images are provided as 224x224 pixel .png
images. In total, this reduced resolution dataset contains 950,047 magnetogram images from 1570 ARs.
This dataset is described in detail in the paper at https://arxiv.org/abs/2305.09492 and is related to the code described in the github repository at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348. The image data and corresponding labels are provided in this dataset, as well as additional data useful for two machine learning problems of flare prediction: 1) a classical machine learning problem using features of magnetic complexity and a support vector machine (SVM) classifier and 2) a deep learning problem using transfer learning on the VGG network. These data are organized to be useful for other classical and deep learning problems, including both classification and regression.
Researchers interested in the full resolution dataset (600x600 pixel .fits
images) may be interested in the Dryad repository https://doi.org/10.5061/dryad.jq2bvq898.
Researchers interested in configuring a custom dataset according to other criteria of latitude, longitude, NaNs, flare size, and flare window may be interested in the full resolution preconfigured dataset at https://doi.org/10.5061/dryad.jq2bvq898 and the extra images dataset at https://doi.org/10.5061/dryad.qjq2bvqmj and the code described under "General Code" in the github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348. Note--Researchers wishing to work with the entire dataset (all 1,357,004 images) to configure a custom dataset must combine the files from the full resolution preconfigured dataset https://doi.org/10.5061/dryad.jq2bvq898 and the extra images dataset https://doi.org/10.5061/dryad.qjq2bvqmj by moving/copying the subdirectories to a common base directory, e.g., active_regions/
.
Description of the data and file structure
The image dataset:
The following files comprise the image dataset itself.
Lat60_Lon60_Nans0_png_224.tar.gz
: a file containing a directory structureLat60_Lon60_Nans0_png_224/
under which are 1570 directories namedNNNN/
, the four digit NOAA AR number. Each of the directoriesNNNN/
contains a variable number of.png
files of the ARNNNN
that satisified the criteria of latitude, longitude, and acceptable number of NaNs. Each.png
file is prepended with the NOAA AR number for ease of correspondence. There are a total of 950,047 images in the dataset. This file is hosted on zenodo https://doi.org/10.5281/zenodo.7775776.C1.0_24hr_224_png_Labels.txt
: a file containing the labels for each of the images in the dataset. The labels are formatted to provide both the regression and classification labels in a form that can be parsed for other applications. Each line in the file is of the formfilename,label
wherefilename
is the base filename in the image set andlabel
is the label. The label is formatted as a stringKX.X
for flaring regions, whereK
is the GOES flare class (C
,M
, orX
) andX.X
is the size, e.g.,4.7
. Non-flaring regions are assigned a label of'0'
.List_of_AR_in_Train_Data_by_AR.csv
,List_of_AR_in_Validation_Data_by_ARcsv
,List_of_AR_in_Test_Data_by_AR.csv
: files containing lists of NOAA ARs assigned to the training, validation, and test sets, respectively. These dataset splits are those used in the paper available at https://arxiv.org/abs/2305.09492. Each line in the files is of the formatNNNN
, the four digit NOAA AR number. Note--these lists are identical between this reduced resolution dataset and the full resolution dataset (available at https://doi.org/10.5061/dryad.jq2bvq898).
Magnetic complexity features for the dataset:
The following file comprises data related to magnetic complexity features of the image dataset. These features were extracted from the reduced resolution images in the dataset and used in a classical machine learning problem in the paper available at https://arxiv.org/abs/2305.09492. The feature extraction and SVM classification code is available on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348.
Lat60_Lon60_Nans0_C1.0_24hr_png_224_features.csv
: file with the 29 magnetic complexity features extracted from each of the reduced resolution images in the preconfigured dataset. Each line of the file contains 32 comma separated values. The first 29 values are the 29 magnetic complexity features as described in the paper at https://arxiv.org/abs/2305.09492. The last three values are the classification label (1
or0
), regression label (flare size as as stringKX.X
or0
), and the base filename. The regression label is formatted as a stringKX.X
for flaring regions, whereK
is the GOES flare class (C
,M
, orX
) andX.X
is the size, e.g.,4.7
.
Dataframes for use in tensorflow:
The following files comprise data helpful for using the image dataset in tensorflow code. The code for transfer learning using this dataset with the VGG network is available on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348.
Train_Data_by_AR_png_224.csv
,Validation_Data_by_AR_png_224.csv
,Test_Data_by_AR_png_224.csv
: files with labels for each of the images in the preconfigured dataset formatted to provide classification labels in the format expected by a dataframe loader in tensorflow for the training, validation, and test sets, respectively. Each line is of the formNNNN/filename,label
whereNNNN
is the AR directory,filename
is the base filename, andlabel
is the classification label (1
for flaring and0
for nonflaring). Researchers interested in developing dataframes for a different dataset split may be interested in the code provided on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348.
Sharing/access information
This dataset incorporates data from three main sources.
- First, in order to focus the image collection on ARs, we used the NOAA Space Weather Prediction Center (SWPC) Solar Region Summaries (SRS) ftp://ftp.swpc.noaa.gov/pub/warehouse/ and parsed those text data to extract the date an AR appeared on disk and the number of days it was visible on disk. Additionally, the SRS provide latitude and longitude of ARs which we use to postprocess the dataset.
- Second, we downloaded magnetogram images from SDO/HMI using the Joint Science Operations Center (JSOC) interface http://jsoc.stanford.edu/ajax/lookdata.html at a cadence of 720 seconds, centered at the NOAA AR centroid (tracked according to the Carrington rate), and with a spatial extent of 600x600 pixels. For this reduced resolution dataset, those 600x600 pixels images are resampled to 224x224 uint8 images (details provided in the paper at https://arxiv.org/abs/2305.09492).
- Third, we used the SWPC Event Reports (ER) ftp://ftp.swpc.noaa.gov/pub/warehouse/ to extract the AR number, peak flare time, and flare size in order to provide labels for those researchers investigating a supervised classification or regression problem.
Code/Software
The code used for the curation of this dataset as well as flare prediction are provided on github at https://github.com/DuckDuckPig/AR-flares/, zenodo DOI https://zenodo.org/badge/latestdoi/284776348. The github repository provides further details on how to run the code.
Methods
This dataset incorporates data from three main sources. First, in order to focus the image collection on ARs, we used the NOAA Space Weather Prediction Center (SWPC) Solar Region Summaries (SRS) (ftp://ftp.swpc.noaa.gov/pub/warehouse/) and parsed those text data to extract the date an AR appeared on disk and the number of days it was visible on disk. Additionally, the SRS provide latitude and longitude of ARs which we use to postprocess the dataset. Second, we download magnetogram images from SDO/HMI using the Joint Science Operations Center (JSOC) interface (http://jsoc.stanford.edu/ajax/lookdata.html) at a cadence of 720 seconds, centered at the NOAA AR centroid (tracked according to the Carrington rate), and with a spatial extent of 600x600 pixels. Third, we used the SWPC Event Reports (ER) (ftp://ftp.swpc.noaa.gov/pub/warehouse/) to extract the AR number, peak flare time, and flare size in order to provide labels for those researchers investigating a supervised classification or regression problem.
Usage notes
Image data are provided in supplementary files available on zenodo (see link under Related works) as .png files which can be opened with any common image manipulation software. All other files included here are text files that can be opened with any standard text manipulation software. We do note, however, that many text files are very large (~1M lines), and may take a while to load.
This is one of three datasets related to the same study:
Reduced resolution dataset (this dataset): Reduced resolution images (950,047 images, each of which is 224x224 pixels and 8-bit depth resolution), https://doi.org/10.5061/dryad.jq2bvq898.
Full resolution dataset: Full resolution images (950,047 images, each of which is 600x600 pixels and 17-bit depth resolution), https://doi.org/10.5061/dryad.dv41ns23n.
Extra images dataset: Images that were excluded from the main analyses in the first and second datasets (421,957 images that were excluded for latitude, longitude, and/or NaN pixels), https://doi.org/10.5061/dryad.qjq2bvqmj. Researchers wishing to work with the entire dataset (all 1,357,004 images) must combine the files from the full resolution preconfigured dataset (https://doi.org/10.5061/dryad.dv41ns23n) and the extra images dataset (https://doi.org/10.5061/dryad.qjq2bvqmj) by moving/copying the subdirectories to a common base directory, e.g., active_regions/.