Skip to main content
Dryad

Data from: Building trust takes time: Limits to arbitrage for blockchain-based assets

Cite this dataset

Voigt, Stefan; Hautsch, Nikolaus; Scheuch, Christoph (2024). Data from: Building trust takes time: Limits to arbitrage for blockchain-based assets [Dataset]. Dryad. https://doi.org/10.5061/dryad.q2bvq83rn

Abstract

The dataset contains all historical order book snapshots and blockchain network information used to generate the results for the paper "Building Trust takes Time".

A blockchain replaces central counterparties with time-consuming consensus protocols to record the transfer of ownership. This settlement latency slows cross-exchange trading, exposing arbitrageurs to price risk. Off-chain settlement, instead, exposes arbitrageurs to costly default risk. We show with Bitcoin network and order book data that cross-exchange price differences coincide with periods of high settlement latency, asset flows chase arbitrage opportunities, and price differences across exchanges with low default risk are smaller. Blockchain-based trading thus faces a dilemma: reliable consensus protocols require time-consuming settlement latency, leading to arbitrage limits. Circumventing such arbitrage costs is possible only by reinstalling trusted intermediation, which mitigates default risk.

README: Data to Replicate the paper Building Trust Takes Time: Limits to Arbitrage for Blockchain-Based Assets

https://doi.org/10.5061/dryad.q2bvq83rn

We provide all datasets required to replicate the paper "Building Trust takes Time: Limits to Arbitrage for Blockchain-based Assets". A description of the data sources and preprocessing steps is provided in the paper. All code to generate the results is available on https://github.com/voigtstefan/building_trust_takes_time.

Description of the data and file structure

In principle, we offer three different types of crypto-currency-related data:

  1. Centralized crypto-exchange (CEX) characteristics, which have been collected manually.
  2. High-frequency order book information. The data has been retrieved by regularly fetching order book information from major centralized crypto exchanges (CEX) in minute-level intervals from 2018 - 2019. We provide the entire order book history across the exchanges with this dataset.
  3. Corresponding information on the state of the Bitcoin blockchain, for instance, the number of outstanding transactions at every point in time. The data has been used to analyze arbitrage activity across CEXes in relation to the time it takes for validators to execute cross-CEX transactions. Code to replicate the data processing parts is publicly available on Github: www.github.com/voigtstefan/building-trust-takes-time

The data contains 14 files.

Exchange characteristics

We provide hand-collected exchange characteristics in the file exchange_characteristics.csv. Columns: exchange, only_crypto (does the exchange only allow crypto transactions?), btc_withdrawal (bitcoin withdrawal fees), maker_fee / taker_fee (trading costs), margin_trading (is margin trading allowed?), us_citizens (are US citizens allowed to use the exchange?), no_of_confirmations (the required number of confirmations before the exchange considers the transaction valid), location/region (region where the headquarter is located), tether (does the exchange trade BTC versus tether or versus USD?), rating_quantitative (exchange rating).

Bitcoin order book data

Next, we collected Bitcoin order book data to investigate price differences across a large sample of CEXes at high frequencies. We gathered order book information from the application programming interfaces (APIs) of the 16 largest CEXes in terms of trading volume in January 2018 that feature BTC versus US Dollar. We retrieve all open buy and sell limit orders for the first 25 order book levels on a minute interval from January 1, 2018, to October 31, 2019. We derived the following files from this data:

  1. best_bids_n_asks.rds Contains the collection of minute-level best bid and ask prices. Columns: exchange, ts (timestamp), bid / ask (best prevailing sell or buy price).
  2. spotvolas.rds We used the mid quotes derived from the file best_bids_n_asks.rds to estimate minute-level spot volatility for each CEX. The file contains the derived estimates of minute-level spot volatility estimates. Columns: ts (timestamp), exchange, spotvola (in basis points).
  3. arbitrage_data.rds Contains a collection of price differences across exchanges. Columns: buy_side / sell_side (indicators of the exchange pairs), ts (timestamp), delta (price difference in basis points), max_q (profit-maximizing trading quantity), spotvola (estimated spot volatility), no_of_confirmations (the required number of confirmations before the exchange considers the transaction valid), delta_q (price difference computed using max_q, the optimal trading quantity), dollar_return (arbitrage profit in percent), q (arbitrage profit in Bitcoin), f (fee level), boundary (estimated arbitrage boundary [from arbitrage_boundaries.rds]),
  4. orderbook_data.sqlite Contains a table called "orderbook_data" with the collection of all minute-level orderbook snapshots for each exchange during the period from January 1st, 2018 until October 31st, 2019. Columns: exchange, ts (timestamp), side (ask or bid quote of the orderbook), level (level of the orderbook, ranging from best price [level = 1] until level = 20), price (quoted price in USD), size (quoted volume in BTC). Venues are active 24/7, thus there are 24 * 60 = 1440 observations per CEX and day.

Blockchain information

To compute the arbitrage boundaries, we rely on information on the state of the blockchain. To quantify the settlement latency for Bitcoin, we gather transaction-specific information from blockchain.com, a popular provider of Bitcoin network data. We download all blocks validated between January 1, 2018, and October 31, 2019, and extract information about all validated transactions on the blockchain in this period. Each transaction contains a
unique identifier, a timestamp of the initial announcement to the network, and, among other details, the fee (per byte) the transaction initiator offers validators to validate the transaction. The following information was aggregated from the mentioned source:

  1. bitcoin_blocks.rds Contains information about fees and executed transactions of each verified block on the Bitcoin blockchain. Columns: height (block number), hash (Bitcoin block hash-ID), time, main_chain (TRUE if the block is recorded in the main chain), ts (formatted timestamp)
  2. latency_hourly.rds Contains information about the latency on the Bitcoin blockchain. Columns: ts (timestamp), mean_latency / median_latency (mean and median latency until validation, in minutes), sd_latency (standard deviation of the latency until validation, in minutes).
  3. latency_duration_model_parameters.rds Contains the estimated time-varying parameters of a duration model to predict the time until validation. Columns: model (exponential/ gamma), type (restricted / unrestricted), parameters (intercept / fee / mempool size / log-likelihood), value, convergence (0 if the optimization procedure converged [=no errors]), mse_ins / mse_oos (mean-squared prediction error in- and out-of-sample), date.
  4. mempool_fees.rds Contains block-level fees paid to validators on the Bitcoin blockchain. Columns: ts (timestamp), tx_fee_per_byte (90% quantile of the current transaction fees in the mempool)
  5. mempool_size.rds Contains block-level size of transactions verified on the Bitcoin blockchain. Columns: ts (timestamp), size (size of the mempool in bytes), number (size of the mempool in number of transactions waiting for verification).

Helper files specific to the paper "Building Trust takes Time"

The data above has been used to analyse the relationship between demand for blockchain validation services and cross-exchange price differences. For that purpose, we derived a number of additional files to make replication or extension of the paper more convenient.

Arbitrage boundaries

  1. arbitrage_boundaries.rds Contains the computes minute-level arbitrage boundaries. Columns: exchange, ts (timestamp), spotvola (minute level volatility), no_of_confirmations (the required number of confirmations before the exchange considers the transaction valid), date, alpha (the parameter alpha of the estimated duration model. Consult the paper for details on the estimation), beta_fee / beta_mempool / beta_constant (the parameter vector beta of the estimated duration model, consult the paper for details on the estimation), tx_fee_per_byte_q9 (the 90% quantile of the transaction fees per byte in the current block), unconfirmed_tx (number of unconfirmed transactions in the mempool), boundary_crra_2 (the estimated arbitrage boundary with risk aversion parameter 2), boundary_no_vola (the estimate arbitrage boundary under the assumption that the volatility of the settlement latency is zero), boundary_0blocks (the estimated arbitrage boundary under the assumption that the required number of confirmations is 0).

Cross-CEX flows

Since exchanges are reluctant to provide the identity of their customers, it is virtually impossible to identify actual transactions by arbitrageurs.
However, we take the overall transfer of assets between two different exchanges as a measure of the trading activity of cross-exchange arbitrageurs.
For each exchange, we thus collect a list of addresses likely under the control of the exchanges in our sample.
We gathered 62.6 million unique exchange addresses, which allowed us to identify 3.9 million cross-exchange transactions with an average daily volume of USD 72 million in our sample period.

The data is provided in 2 files.

  1. cross_exchange_flows.rds. Identifies cross-exchange Bitcoin transactions for every Bitcoin block. Columns: tx_hash (unique ID for each transaction), ts (timestamp), from / to (originating exchange and receiving exchange), tx_size (size of the transaction in Byte), tx_fee (fees of the transaction in Satoshis), tx_lock_time (lock time for transaction if submitted by originator), block_hash (unique ID for each block), block_time (UNIX timestamp for time of block validation), tx_type (according to blockchain information, output for cross-exchange flows), volume (size of the transaction in Satoshis), tx_address (wallet address of originator [used to identify "from" exchange]).
  2. clean_flows_and_balances_hourly.rds. Aggregated hourly flow data and the resulting balance of Bitcoin at each exchange. Columns: timestamp (hour), exchange, net_flow (number of incoming Satoshis net of outgoing Satoshis), balance (resulting balance in Satoshi at each exchange).

Regression file

The main results of the paper "Building Trust Takes Time" rely on two regressions that combine all of the information above. For the reader's convenience, we offer this file for direct use.

exchange_pair_hourly_regression_sample.rds. Contains only columns that have been introduced before. Columns: hour / pair (identifying information for regression), buy_side / sell_side (identifying information of buy and sell market, delta (price difference in basis points), spotvola (spot volatility on the sell-side exchange, average for the hour), spread (bid-ask on the sell-side market), latency_median /latency_sd (latency parameters from the duration model), tether / margin_trading / business_accounts / region / rating_categorial / aa_rating sell and buy (exchange characteristics from file exchange_characteristics.csv, separated for buy and sell side exchange) , balance_sell / balance_buy (balance in BTC on buy / sell side exchange), flow_volume (net cross-exchange volume during the hour), boundary_margin_sell / boundary_margin_buy (interaction term which is the product of the arbitrage boundary and the availability of margin trading), boundary_business_sell / boundary_business_buy (interaction term which is the product of the arbitrage boundary and the availability of business accounts), latency_variance / latency_variance_std (latency variance and standard deviation), latency_median_std (median latency scaled to exhibit zero sample mean), flow_volume_usd (net flows in USD)

Code/Software

All code to generate the results is available on https://github.com/voigtstefan/building_trust_takes_time.

Methods

This dataset provides three types of Bitcoin-related information:

  1. Centralized crypto-exchange (CEX) characteristics which have been collected manually.
  2. High-frequency order book information. The data has been retrieved by regularly fetching order book information from major centralized crypto exchanges (CEX) in minute-level intervals from 2018 - 2019. We provide the entire order book history across the exchanges with this dataset. 
  3. Corresponding information on the state of the Bitcoin blockchain, for instance, the number of outstanding transactions at every point in time. The data has been used to analyze arbitrage activity across CEXes in relation to the time it takes for validators to execute cross-CEX transactions. Code to replicate the data processing parts is publicly available on Github: www.github.com/voigtstefan/building-trust-takes-time

Funding