Classifications of auroral phenomena in THEMIS All-Sky images obtained via self-supervised learning

Johnson, Jeremiah 1 ; Ozturk, Dogacan2 ; Connor, Hyunju3 ; Hampton, Donald2 ; Blandin, Matthew2 ; Keesee, Amy1

Published Nov 28, 2024 on Dryad. https://doi.org/10.5061/dryad.sbcc2frft

Data files

Nov 28, 2024 version files 23.41 GB

README.md
2.68 KB
themis-asi-predictions.zip
23.41 GB

Abstract

We report a novel machine learning algorithm for automatically detecting and classifying aurora in all-sky images (ASI) that is largely trained without requiring ground-truth labels. By including a small number of labeled images, we are able to automatically label all of the approximately 700 million images in the Time History of Events and Macroscale Interactions during Substorms (THEMIS) ASI dataset from 2008 to 2022. We use a two-stage approach. In the first stage, we adapt the Simple framework for Contrastive Learning of Representations (SimCLR) algorithm to learn latent representations of THEMIS all-sky images. We then finetune a classifier network on the latent representations our model learns of the manually labeled Oslo aurora THEMIS (OATH) dataset. We demonstrate that this two-stage approach achieves excellent classification results on data for which there is no current ML classification benchmark. The outcome of this work will facilitate efficient information retrieval for researchers interested in specific categories of aurora and will enable large scale statistical studies and machine learning analyses of THEMIS all-sky images that have not previously been possible. To demonstrate possible ways to utilize this database, we performed a statistical analysis of the occurrence rates of auroral labels with respect to solar wind parameters, interplanetary magnetic field vector, and geomagnetic indices. We further investigate the occurrence rates of auroral phenomena in the annotated data set and their geoeffectiveness by utilizing the co-located THEMIS ground magnetometer data set.

https://doi.org/10.5061/dryad.sbcc2frft

Description of the data and file structure

This dataset contains classifications of all THEMIS All-Sky Images (ASI) captured 2008-2022 into one of six categories: arc, diffuse, discrete, cloudy, moon, clear. For each image, a probability for each of the six categories is provided. The classifications were obtained using a self-supervised machine learning model accessible at the link below.

Files and variables

File: themis-asi-predictions.zip

Description: This compressed directory contains the classification data. The data is organized into subdirectories by date in the format YYYY-MM-DD. Each subdirectory contains all of the classification data for the corresponding date in compressed CSV files. Each compressed CSV file contains one hour’s worth of data for one THEMIS ASI site.

Each compressed CSV file follows the naming convention SITE-DATETIME-probs.csv.gz. The DATETIME format used for the filenames is YYYY-MM-DDThh, where hh represents the hour during which the data was collected and is in UTC. SITE is a lowercase four-letter abbreviation for the THEMIS all-sky camera site where the data was collected.

A full list of THEMIS ASI sites and their abbreviations can be accessed at https://themis.ssl.berkeley.edu/gmag/asi_list.php?selyear=4000&selmonth=13&smap=on&sinfo=on&seltxt=0.

For example, in the subdirectory named 2013-03-15, the file gako-20130315T08-probs.csv.gz contains predictions for one hours’ worth of data obtained from the Gakona site on March 15, 2013 beginning at 08:00 UTC and ending at 09:00 UTC.

Each compressed CSV file contains 7 columns named arc, diffuse, discrete, cloudy, moon, clear, time, and each row contains the probabilities for each of the six classifications along with a timestamp in UTC that can be used to map the classifications to the corresponding ASI. Since THEMIS ASI are collected at a 3 second cadence, most files contain 1200 rows, though some (those overlapping daylight hours) are shorter.

Code/software

The classification data are in compressed CSV files that can be viewed using any text editor.

Access information

Other publicly accessible locations of the data:

None

Data was derived from the self-supervised machine learning model that can be accessed here:

https://doi.org/10.5281/zenodo.11397580