Skip to main content

Estimation of reinforced urn processes under left-truncation and right-censoring

Cite this dataset

Souto Arias, Luis Antonio; Cirillo, Pasquale; Oosterlee, Cornelis W. (2022). Estimation of reinforced urn processes under left-truncation and right-censoring [Dataset]. Dryad.


We propose a nonparametric estimator for bivariate left-truncated and right-censored (LTRC) observations that combines the Expectation-Maximization (EM) algorithm and the Reinforced Urn Process (RUP). The resulting Expectation-Reinforcement (ER) algorithm allows for the inclusion of experts' knowledge in the form of a prior distribution, thus belonging to the class of Bayesian models. This can be relevant in applications where the data is incomplete, due to biases in the sampling process, as in the case of left-truncation and right-censoring. With this new approach, the distribution of the truncation variables is also recovered, granting further insight into those biases, and playing an important role in applications like prevalent cohort studies. The estimators are tested numerically using artificial and empirical datasets and compared with other methodologies such as copula models and the Kaplan-Meier estimator.


The simulated data was generated by the authors, while the empirical Canadian data was obtained from Prof. E. W. Frees, to whom one may be referred to get access to said data. Therefore, this dataset contains only the synthetic data and not the empirical one.

The synthetic data consists of 10000 observations of bivariate left-truncated and right-censored samples. Each observation contains the two (possibly censored) observed values (X and Y), the two truncation values (TX and TY), the two censoring indicators deltaX and deltaY (we use the convention: 1 for uncensored and 0 for censored), and finally the difference between TY and TX, TY-TX. This makes a total of 7 variables per observation. Clearly, only the first 6 variables are independent.

The distributions of each variable, as well as the specific censoring and truncation mechanisms, can be found in the script Due to differences in the pseudo-random sequences, running the aforementioned script may yield different datasets than the one we provide, but because the underlying distributions are the same, the statistical properties should also be the same, up to numerical errors.

The code to generate the simulated data can be accessed in the public repository, or in the attached code, specifically the Python script.

Usage notes

The main routines are implemented in C++, while the data generation and the post-analysis of the results are in written in Python. No special libraries are required to run the Python code (apart from the standard numpy, matplotlib, scipy, etc.). A small part of the C++ code requires the Boost library, but this is only a few lines of code and can be easily changed to not require said library.


H2020-EU.1.3.1. MSCA-ITN- 2018, Award: 813261