Synthetic temporal dataset for temporal trend analysis and retrieval
Data files
May 07, 2024 version files 2.11 GB
Abstract
This repository contains a synthetic, temporal data set that was generated by the authors by sampling values from the Gaussian distribution. The dataset contains eight nontemporal dimensions, a temporal dimension, and a numerical measure attribute. The data set was generated according to the scheme and procedure detailed in this source paper: Kaufmann, M., Fischer, P.M., May, N., Tonder, A., Kossmann, D. (2014). TPC-BiH: A Benchmark for Bitemporal Databases. In: Performance Characterization and Benchmarking. TPCTC 2013. Lecture Notes in Computer Science, vol 8391. Springer, Cham. The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be of interest to understand which nontemporal dimensions are associated with the temporal trends of interest. To this end, the data set can be used for analyzing and locating temporal trends in the data cube induced by the data set.
README: Synthetic temporal dataset for temporal trend analysis and retrieval
https://doi.org/10.5061/dryad.q573n5trf
The data set can be used for analyzing and locating temporal trends of interest, where a temporal trend is generated by selecting the desired values of the nontemporal dimensions, and then selecting the corresponding values of the temporal dimension and the numerical measure attribute. Locating temporal trends of interest, e.g., unusual trends, is a common task in many applications and domains. It can also be of interest to understand which nontemporal dimensions are associated with the temporal trends of interest. To this end, the data set can be used for analyzing and locating temporal trends in the data cube induced by the data set, e.g., retrieving outlier temporal trends using an outlier detector.
We generated the synthetic temporal data set [1], which contains up to 8 nontemporal dimensions, one temporal dimension, and a numerical measure attribute. The data set was generated using the DataFiller program [2] sampling values from the Gaussian distribution, to ensure that unusual trends exist in the data set.
- Kaufmann, M., Fischer, P.M., May, N., Tonder, A., Kossmann, D. (2014). TPC-BiH: A Benchmark for Bitemporal Databases. In: Performance Characterization and Benchmarking. TPCTC 2013. Lecture Notes in Computer Science, vol 8391. Springer, Cham.
- Coelho, F. (2014). DataFiller − generate random data from database schema. https://www.cri.ensmp.fr/people/coelho/datafiller.html
Description of the data and file structure
RawSythesizedData folder has the raw synthesized data from Datafiller in.csv format. Each file has 4-8 columns and 6 million tuples. In our experiments with varying numbers of tuples, we sampled each 6 million tuples to the sizes used in the experiments via the Python pandas built-in sample function.
PreprocessedData_xxx folders have the final data on which we run TrendSurfer I, II, and the exhaustive approach. They are just the obtained data samples after preprocessing, such as imputing missing values and normalization.
- The "RawSythesizedData" folder has the raw synthesized data from Datafiller in .csv format. Each file has 6-8 columns and 6 million tuples. In our experiments with varying numbers of tuples, we sampled each 6 million tuples to the sizes used in the experiments via the Python pandas built-in sample function.
- The file tpc-bih-part-n*cols-VARCHAR_6000000.csv is the data set with *n+2 total columns: n nontemporal dimensions, a temporal dimension, and a measure attribute of interest
- The nontemporal dimensions (columns) p_partkey, p_name, p_mfgr, p_brand, p_type, p_container, p_col5, p_col6, p_col7, p_col8 are VARCHAR type data columns by which the data can be grouped. The p_retailprice column is the measure attribute of interest and contains numerical values. The start_date column is the temporal attribute and contains date values.
- The "datafiller" folder contains the possible values for each of the nontemporal dimensions.
- The "PreprocessedData_ChangeDimension" folder has the final data on which we conducted our experiments when varying the number of nontemporal dimensions in the data set. These are the data samples after preprocessing, such as imputing missing values and normalization.
- The folder "knn_k70/4-8cols_VARCHAR_gauss_23428" contains the data used with the k-Nearest Neighbors (kNN) outlier detector. These files contain 4-8 nontemporal dimensions and 23,428 tuples (rows).
- The file names within this folder include the names of the 4-8 nontemporal dimensions associated with each data set.
- Each file contains the p_retailprice measure attribute and the start_date temporal attribute.
- The folder "pca_k3_cblof_k10/4-8cols_VARCHAR_gauss_187500" contains the data used with the Principal Component Analysis (PCA) and Clustering-Based Detector (CBLOF) outlier detectors. These files contain 4-8 nontemporal dimensions and 187,500 tuples (rows).
- The file names within this folder include the names of the 4-8 nontemporal dimensions associated with each data set.
- Each file contains the p_retailprice measure attribute and the start_date temporal attribute.
- The folder "knn_k70/4-8cols_VARCHAR_gauss_23428" contains the data used with the k-Nearest Neighbors (kNN) outlier detector. These files contain 4-8 nontemporal dimensions and 23,428 tuples (rows).
- The "PreprocessedData_ChangeTuple" folder has the final data on which we conducted our experiments when varying the number of tuples in the data set. These are the data samples after preprocessing, such as imputing missing values and normalization.
- The folder "knn_k8" contains the data used with the k-Nearest Neighbors (kNN) outlier detector. These files contain 4 nontemporal dimensions and a variable number of tuples (rows).
- The folder names within this one include the number of tuples that are contained in each file (11719, 23438, 46875, 93750, or 187500).
- Each file contains the p_mfgr, p_brand, p_type, and p_container nontemporal dimensions, the p_retailprice measure attribute and the start_date temporal attribute.
- The folder "pca_k3_cblof_k5" contains the data used with the Principal Component Analysis (PCA) and Clustering-Based Detector (CBLOF) outlier detectors. These files contain 4 nontemporal dimensions and a variable number of tuples (rows).
- The folder names within this one include the number of tuples that are contained in each file (187500, 375000, 750000, 1500000, 3000000, or 6000000).
- Each file contains the p_mfgr, p_brand, p_type, and p_container nontemporal dimensions, the p_retailprice measure attribute and the start_date temporal attribute.
- The folder "knn_k8" contains the data used with the k-Nearest Neighbors (kNN) outlier detector. These files contain 4 nontemporal dimensions and a variable number of tuples (rows).
Other sources
In our experiments, we used the following outlier detectors:
- Principal Component Analysis (PCA):
- Shyu, M.-L., Chen, S.-C., Sarinnapakorn, K., Chang, L. (2003). A novel anomaly detection scheme based on principal component classifier. Technical report, Miami University Department of Electrical and Computer Engineering.
- k-Nearest Neighbors (kNN):
- Ramaswamy, S., Rastogi, R., Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, vol 29 no. 2, pp. 427–438.
- Clustering-Based Detector (CBLOF):
- He, Z., Xu, X., Deng, S. (2003). Discovering cluster-based local outliers. Pattern Recognition Letters vol 24 issues 9, pp. 1641–1650.