5-minute high-frequent data for SSE 50, CSI300, CSI500 and CSI 1000 indices
Data files
Aug 28, 2025 version files 197.95 MB
-
Data1.zip
96.91 MB
-
Data2_code.zip
101.04 MB
-
README.md
7.05 KB
Abstract
The realized recurrent conditional heteroscedasticity (RealRECH) model improves volatility prediction by integrating long short-term memory (LSTM), a recurrent neural network unit, into the realized generalized autoregressive conditional heteroskedasticity (RealGARCH) model. However, at present, there is no literature on the ability of the RealRECH model to fit and predict volatility in the Chinese market. In this paper, a study is conducted to test the in-sample explainability and out-of-sample prediction ability of the RealRECH model for the SSE50, CSI300, CSI500, and CSI1000 indices in the Chinese market and to determine whether it performs better than the RealGARCH model. The results of the in-sample analysis show that the RealRECH model not only provides better in-sample interpretability for all four indices but also captures the complex dynamics of time series volatility that the RealGARCH model cannot capture, such as long-term dependence and nonlinearity. The results of out-of-sample volatility prediction show that the RealRECH model better predicts the volatility of the CSI500 and CSI1000 indices but yields worse predictions for the SSE50 and CSI300 indices. Thus, the RealRECH model can be used for CSI500 and CSI1000 prediction.
The objective of this study is to conduct in-sample analysis and out-of-sample prediction of the volatility of four Chinese stock indices using the RealGARCH model and the LSTM-RealGARCH(RealRECH) model, and to compare their effectiveness in analyzing and predicting the volatility of the Chinese stock indices. Therefore, we divided all data into in-sample and out-of-sample datasets.
The compressed package Data1 includes 2 folders, which are CSI300 and CSI500, and they both include 2 folders, SMC_for_RealGARCH and SMC_for_LSTM_RealGARCH.
The compressed package Data2&code includes 3 folders: CSI1000, SSE50 and RealRECH_norm, a compressed package RealRECH_norm and a file realized_china. The folders CSI1000 and SSE50 also include 2 folders, SMC_for_RealGARCH and SMC_for_LSTM_RealGARCH.The file realized_china contains the raw data we used in this study, which includes 5-minute high-frequency data for 2000 trading days of four Chinese stock indices.
The folders SSE50, CSI1000, CSI300 and CSI500 all contain the output datasets of the volatility predictions of the four Chinese stock indices using 5-minute high-frequency data for 500 trading days. They are the results of the two models we used with the SMC algorithm with data annealing and can be compared with the actual data. Of course, the folder SMC_for_RealGARCH contains the output datasets of the volatility predictions through the RealGARCH model and the folder SMC_for_LSTM_RealGARCH contains the output datasets of the volatility predictions through the LSTM-RealGARCH(RealRECH) model.
The compressed package RealRECH_norm and the folder RealRECH_norm contain the same contents.
In the folder RealRECH_norm, the folder in_sample contains the results of in_sample analysis of 1500 trading days of 5-minute high-frequency data, in this folder, there is a folder utils_bag_in_sample contains the code file lors, conditional_variance_realgarch and conditional_variance_lstm_realgarch, a folder Conditional Variance Fig contains four conditional variance figs of Chinese indices and a folder residual_QQ contains four residual QQ plots of these Chinese indices. There are also four code files in the folder in_sample, which are code file calculate_lors, plot_conditional_variance_omega, plot_residual_QQ and residuals_analysis. And the three analysis result files are residual_analysis.xlsx, summary_lors.txt and summary_lors.xlsx.
The code files lors and calculate_lors are using for the modified Lo’s test proposed by Lo in 1991, and the result files summary_lors.txt and summary_lors.xlsx contain the same results of the modified Lo’s test.
The code files conditional_variance_realgarch and conditional_variance_lstm_realgarch are using for in_sample analysis through the RealGARCH model and the LSTM-RealGARCH(RealRECH) model, respectively. The code file residuals_analysis is for descriptive statistics such as the skewness and kurtosis of residuals and the result file residual_analysis.xlsx is the result of descriptive statistics.
The code files plot_conditional_variance_omega and plot_residual_QQ are using for the conditional variance figs and residual QQ plots of four Chinese indices, respectively. And the folders Conditional Variance Fig and residual_QQ contain the results.
In the folder RealRECH_norm, the folder out_sample contains 2 folders: one step for lstm and one step for realrech, 3 code files: forecasted_volatility_predictive_scores, plot_forecasted_volatility and residuals_analysis, and 2 result files: residual_analysis and performance. The code file residuals_analysis is and the result file residual_analysis are the same.
The code file plot_forecasted_volatility is using for plot generation of the one-step volatility forecasting of the four Chinese indices through the two models we used, and the plots are all in the folder one step for lstm, and the folder one step for realrech is empty.
The code file forecasted_volatility_predictive_scores is using for out-of-sample volatility predicting and the result file is the file performance, it shows 6 different indicators: MSE1, MSE2, MAE1 , MAE2, QLIKE and R2LOG. We conducted them to measure the predicting effect of the two models. The lower the indicator is, the better the effect will be.
In the folder RealRECH_norm, the folder SMC_for_LSTM_RealGARCH contains 9 different code files. The code file LSTM_RealGARCH_DataAnneal, LSTM_RealGARCH_LikAnneal and LSTM_RealGARCH_LikDataAnneal are using for Sequential Monte Carlo (SMC) sampling with likelihood annealing or data annealing methods. The code file LSTM_RealGARCH_llh and LSTM_RealGARCH_llh_conditional are using for the calculation of the indicator llh we conducted to measure whether the model is better than the other on in_sample analysis. The code file LSTM_RealGARCH_logPriors is using for the calculation of other in_sample indicators such as beta and gamma. The code file LSTM_RealGARCH_one_step_forecast is using for one step out-of-sample volatility predicting and can generate the 6 out-of-sample indicators we conducted: MSE1, MSE2, MAE1 , MAE2, QLIKE and R2LOG. The code file LSTM_RealGARCH_residual_analysis is using for descriptive statistics and the code file SMC_LSTM_RealGARCH_run combines the 3 code files of SMC sampling. All the code files are using for analysis through the LSTM-RealGARCH(RealRECH) model.
In the folder RealRECH_norm, the folder SMC_for_RealGARCH contains 9 different code files. The code file RealGARCH_DataAnneal, RealGARCH_LikAnneal and RealGARCH_LikDataAnneal are using for Sequential Monte Carlo (SMC) sampling with likelihood annealing or data annealing methods. The code file RealGARCH_llh and RealGARCH_llh_conditional are using for the calculation of the indicator llh we conducted to measure whether the model is better than the other on in_sample analysis. The code file RealGARCH_logPriors is using for the calculation of other in_sample indicators such as beta and gamma. The code file RealGARCH_one_step_forecast is using for one step out-of-sample volatility predicting and can generate the 6 out-of-sample indicators we conducted: MSE1, MSE2, MAE1 , MAE2, QLIKE and R2LOG. The code file RealGARCH_residual_analysis is using for descriptive statistics and the code file SMC_RealGARCH_run combines the 3 code files of SMC sampling. All the code files are using for analysis through the RealGARCH model.
The code files analysis and analysis_llhs contain the code for in-sample analysis, and the code file run_func contains the code for out-of-sample predicting. We combined the two in-sample analysis code files with other in-sample codes in the folders SMC_for_LSTM_RealGARCH and SMC_for_RealGARCH to conduct our in-sample analysis, and we combined the out-of-sample analysis code file with other out-of-sample codes in the folders SMC_for_LSTM_RealGARCH and SMC_for_RealGARCH to conduct our out-of-sample analysis. All analyses were conducted using MATLAB software.
The high-frequency data in our research is from the Wind Financial Terminal at Southwestern University of Finance and Economics (SWUFE). SWUFE has purchased access to the Wind Database and has Wind terminals on campus, which allows us to download the data needed for our research from these terminals. Currently, there are no free public channels for accessing high-frequency data on Chinese stock indices. Researchers can either use the Wind Financial Terminal for paid access or purchase it through Chinese exchanges or brokers.
