Price prediction is thought to be very difficult for any equity, however the price spread between two exchanges may not. We look into the price spread between two of the largest Bitcoin exchanges: Bitstamp and BTC-E.

Step 1: Import All Packages

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy import stats as ss
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta
from matplotlib import pyplot as plt
from matplotlib import dates as mdates
from statsmodels.graphics.api import qqplot

Step 2: Data Reading and Cleaning

Data source: http://api.bitcoincharts.com/v1/csv/

1
2
bitstamp = pd.read_csv('bitstampUSD.csv')
btce = pd.read_csv('btceUSD.csv')
1
2
print(bitstamp.shape)
print(btce.shape)
(11044600, 3)
(29688118, 3)
1
bitstamp.head()
1315922016 5.800000000000
0 1315922024 5.83
1 1315922029 5.90
2 1315922034 6.00
3 1315924373 5.95
4 1315924504 5.88

The column names are actually timestamp (in the form of 10 digits), prices and volumes.

1
2
3
4
bitstamp.columns = ['time', 'price', 'volume']
btce.columns = ['time', 'price', 'volume']
bitstamp.time = pd.to_datetime(bitstamp.time, unit='s')
btce.time = pd.to_datetime(btce.time, unit='s')

Here as an illustration we only use price itself as the indicator.

1
2
bitstamp = bitstamp[['time','price']]
btce = btce[['time','price']]
1
bitstamp.head()
time price
0 2011-09-13 13:53:44 5.83
1 2011-09-13 13:53:49 5.90
2 2011-09-13 13:53:54 6.00
3 2011-09-13 14:32:53 5.95
4 2011-09-13 14:35:04 5.88

What I do here is merely adding two columns called "date" and then "hour". Then I combine these two columns and compute the hourly average prices.

1
2
3
4
bitstamp.date = bitstamp.time.dt.date.apply(lambda d: dt.strftime(d, '%Y-%m-%d'))
btce.date = btce.time.dt.date.apply(lambda d: dt.strftime(d, '%Y-%m-%d'))
bitstamp.hour = bitstamp.time.dt.hour.apply(lambda h: str(h).zfill(2) + ':00:00')
btce.hour = btce.time.dt.hour.apply(lambda h: str(h).zfill(2) + ':00:00')
1
2
bitstamp.date_hour = bitstamp.date + ' ' + bitstamp.hour
btce.date_hour = btce.date + ' ' + btce.hour
1
2
bitstamp_hour = bitstamp.groupby(bitstamp.date_hour, as_index=1).mean()
btce_hour = btce.groupby(btce.date_hour, as_index=1).mean()

Below I reindex these two datasets to make subscripts consistent.

1
2
bitstamp_hour = bitstamp_hour.reindex(pd.Series(pd.date_range(bitstamp_hour.index[0], bitstamp_hour.index[-1], freq='1H')).dt.strftime('%Y-%m-%d %H:%M:%S'), method='ffill')
btce_hour = btce_hour.reindex(pd.Series(pd.date_range(btce_hour.index[0], btce_hour.index[-1], freq='1H')).dt.strftime('%Y-%m-%d %H:%M:%S'), method='ffill')
1
2
3
4
dataset = pd.concat([bitstamp_hour, btce_hour], axis=1).dropna()
dataset.columns = ['bitstamp', 'btce']
dataset.index = pd.to_datetime(dataset.index)
dataset.head()
bitstamp btce
2011-09-13 13:00:00 5.9100 5.601000
2011-09-13 14:00:00 5.8675 5.981700
2011-09-13 15:00:00 5.6500 5.563333
2011-09-13 16:00:00 5.6500 5.481429
2011-09-13 17:00:00 5.6500 5.316154

Step 3: Data Analysis

1
2
3
4
5
6
7
8
9
10
11
12
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
ax.plot(dataset.bitstamp, lw=0.8, c='b', label='bitstamp')
ax.plot(dataset.btce, lw=0.8, c='r', label='btce')
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=3))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y\n %b'))
plt.legend(loc='upper left')
plt.tight_layout()
plt.title('historical prices')
plt.show()

Only look at data between Aug 2014 and Feb 2015. Before Jan 2015 is the training dataset and the month after is saved as testing dataset.

1
2
3
4
5
6
7
start_date = pd.to_datetime('2014-08-01 00:00:00')
mid_date = pd.to_datetime('2015-01-01 00:00:00')
end_date = pd.to_datetime('2015-02-01 00:00:00')
train = dataset[dataset.index >= start_date]
train = train[train.index < mid_date]
test = dataset[dataset.index >= mid_date]
test = test[test.index < end_date]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
ax.plot(train.bitstamp, lw=0.8, c='b', label='bitstamp')
ax.plot(train.btce, lw=0.8, c='r', label='btce')
ax.plot(test.bitstamp, lw=0.8, c='b', label='bitstamp')
ax.plot(test.btce, lw=0.8, c='r', label='btce')
ax.axvline(x=mid_date, c='k', lw=0.8, ls='--')
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y\n %b'))
plt.legend(loc='upper left')
plt.tight_layout()
plt.title('historical prices')
plt.show()

1
2
train_price_spread = (train.bitstamp - train.btce).fillna(method='bfill')
train_price_spread.describe()
count    3672.000000
mean        5.159550
std         4.879880
min       -25.431669
25%         2.789031
50%         4.658016
75%         7.054407
max        49.270572
dtype: float64

Now let's plot the price spread.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
ax.plot(train_price_spread, lw=0.8, c='k')
ax.axhline(train_price_spread.mean(), lw=0.8, ls='--', c='b', label='$\mu$')
ax.axhline(train_price_spread.std()*3, lw=0.8, ls='--', c='r',label='$\pm3\sigma$')
ax.axhline(-train_price_spread.std()*3, lw=0.8, ls='--', c='r')
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y\n %b'))
plt.tight_layout()
plt.legend(loc='upper left')
plt.title('price spread')
plt.show()

1
sm.stats.diagnostic.acorr_ljungbox(train_price_spread, lags=1)[1][0]  # p small rejecting null of no autocorrelation, so the raw data is not only white noise
0.0

Truncate data that exceed \(3\sigma\) boundaries. Those are though outliers.

1
2
3
4
5
6
7
train_price_spread_2 = train_price_spread
mu = train_price_spread_2.mean()
sigma = train_price_spread_2.std()
outlier_idx = (train_price_spread_2 - mu).abs() > 3 * sigma
train_price_spread_2[outlier_idx] = np.nan
train_price_spread_2 = train_price_spread_2.interpolate().fillna(method='bfill').fillna(method='ffill')
train_price_spread_2.describe()
count    3672.000000
mean        5.156679
std         3.686989
min        -8.397757
25%         2.806244
50%         4.697575
75%         7.080410
max        19.488954
dtype: float64
1
2
3
4
5
6
7
8
9
10
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
ax.plot(train_price_spread_2, lw=0.8, c='k')
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y\n %b'))
plt.tight_layout()
plt.title('price spread (outlier eliminated)')
plt.show()

1
sm.tsa.adfuller(train_price_spread_2)[1]  # p-value small, so reject H0 of unit root. Stationary.
3.1396679727739154e-06

Rolling smooth the data, window at 24 hours.

1
2
train_price_spread_rolling_mean = train_price_spread_2.rolling(24).mean().fillna(method='bfill').fillna(method='ffill')
train_price_spread_rolling_mean.describe()
count    3672.000000
mean        5.161474
std         3.214953
min        -3.364170
25%         2.917758
50%         4.664176
75%         6.803972
max        16.700286
dtype: float64
1
2
3
4
5
6
7
8
9
10
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
ax.plot(train_price_spread_rolling_mean, lw=0.8, c='k')
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y\n %b'))
plt.tight_layout()
plt.title('price spread rolling mean (window: 24 hours)')
plt.show()

1
sm.tsa.adfuller(train_price_spread_rolling_mean)[1]  # reject H0 of unit root. Stationary.
8.4737772982965124e-05
1
2
3
4
5
6
7
8
9
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(train_price_spread_rolling_mean, lags=20, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(train_price_spread_rolling_mean, lags=20, ax=ax2)
plt.tight_layout()
plt.show()

ACF slowly decreasing and PACF truncated after 3. Try ARIMA(3,0,0) and if non-stationary, use ARIMA(3,1,0)

1
2
3
4
5
6
7
8
9
try:
train_model = sm.tsa.ARIMA(train_price_spread_rolling_mean, order=[3,0,0])
train_result = train_model.fit()
print(train_result.summary())
except ValueError:
print('ARIMA(3,0,0) not stationary, using ARIMA(3,1,0)\n')
train_model = sm.tsa.ARIMA(train_price_spread_rolling_mean, order=[3,1,0])
train_result = train_model.fit()
print(train_result.summary())
ARIMA(3,0,0) not stationary, using ARIMA(3,1,0)

                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                    D.y   No. Observations:                 3671
Model:                 ARIMA(3, 1, 0)   Log Likelihood                4041.229
Method:                       css-mle   S.D. of innovations              0.080
Date:                Thu, 20 Apr 2017   AIC                          -8072.458
Time:                        21:04:19   BIC                          -8041.417
Sample:                    08-01-2014   HQIC                         -8061.407
                         - 12-31-2014                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        9.24e-05      0.009      0.010      0.992      -0.018       0.018
ar.L1.D.y      0.7278      0.016     44.224      0.000       0.696       0.760
ar.L2.D.y      0.0560      0.020      2.750      0.006       0.016       0.096
ar.L3.D.y      0.0748      0.016      4.546      0.000       0.043       0.107
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.1284           -0.0000j            1.1284           -0.0000
AR.2           -0.9381           -3.3109j            3.4412           -0.2939
AR.3           -0.9381           +3.3109j            3.4412            0.2939
-----------------------------------------------------------------------------
1
sm.tsa.adfuller(train_price_spread_rolling_mean)[1]  # p-value small, reject H0 of unit root. Stationary. (Don't need to take differences)
8.4737772982965124e-05
1
sm.stats.durbin_watson(train_result.resid)  # DW statistic close to 2, autoregression well elimianted (p=3 is enough)
2.0080126845985315

Normality test.

1
ss.normaltest(train_result.resid)
NormaltestResult(statistic=105.33826651992865, pvalue=1.3368603987900652e-23)
1
2
3
4
5
6
7
8
9
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
fig = qqplot(train_result.resid, line='q', ax=ax, fit=True)
ax.set_xlim([-4,4])
ax.set_ylim([-4,4])
plt.tight_layout()
plt.show() # severe fat tail

Step 4: Prediction

1
2
test_price_spread = test.bitstamp - test.btce
test_price_spread_rolling_mean = test_price_spread.rolling(24).mean().fillna(method='bfill').fillna(method='ffill')

Since we have ARIMA(3,1,0) model as below

\[ \Delta y_t=\phi_0 + \phi_1 \Delta y_{t-1} + \phi_2 \Delta y_{t-2} + \phi_3 \Delta y_{t-3} + \epsilon_t \]

which is equivalent to

\[ y_t - y_{t-1}=\phi_0 + \phi_1 y_{t-1} - \phi_1 y_{t-2} + \phi_2 y_{t-2} - \phi_2 y_{t-3} + \phi_3 y_{t-3} - \phi_3 y_{t-4} + \epsilon_t, \]

and this simplifies to

\[ y_t=\phi_0 + (1 + \phi_1) y_{t-1} + (\phi_2-\phi_1) y_{t-2} + (\phi_3-\phi_2) y_{t-3} - \phi_3 y_{t-4} + \epsilon_t. \]

So \(\hat y_t\) is given by

\[ \hat y_t=\hat\phi_0 + (1 + \hat\phi_1) y_{t-1} + (\hat\phi_2-\hat\phi_1) y_{t-2} + (\hat\phi_3-\hat\phi_2) y_{t-3} - \hat\phi_3 y_{t-4}. \]

1
2
p = train_result.params
phi = [p[0], 1 + p[1], p[2] - p[1], p[3] - p[2], -p[3]]
1
2
3
4
def yhat(ypast):
# function to compute next y with four previous inputs
y4, y3, y2, y1 = ypast
return phi[0] + phi[1] * y1 + phi[2] * y2 + phi[3] * y3 + phi[4] * y4
1
2
3
4
5
6
7
8
9
fit = train_price_spread_rolling_mean * np.nan
for i in range(4, len(train_price_spread_rolling_mean)):
fit[i] = yhat(train_price_spread_rolling_mean[i-4:i].values)
fit = fit.fillna(method='bfill')

pred = test_price_spread_rolling_mean * np.nan
for i in range(4, len(test_price_spread_rolling_mean)):
pred[i] = yhat(test_price_spread_rolling_mean[i-4:i].values)
pred = pred.fillna(method='bfill')
1
2
3
fcst = train_result.forecast(len(test))
forecast = pd.Series(fcst[0], index=test.index)
conf_interval = pd.DataFrame(fcst[2], index=test.index)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
ax.plot(train_price_spread_rolling_mean, lw=1.6, c='k', label='true data')
ax.plot(test_price_spread_rolling_mean, lw=1.6, c='k')
ax.plot(fit, 'rx-', markevery=30, lw=0.8, label='insample fitted data')
ax.plot(pred, 'bx-', markevery=30, lw=0.8, label='outsample daily prediction')
ax.plot(forecast, 'b--', lw=0.8, label='outsample one monthforecast')
ax.axvline(mid_date, color='k', ls='--', lw=0.8)
ax.fill_between(conf_interval.index, conf_interval.ix[:,0], conf_interval.ix[:,1], color='k', alpha=.1)
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y\n %b'))
plt.tight_layout()
plt.legend(loc='upper left')
plt.title('price spread rolling mean (window: 24 hours)')
plt.show()

1
print('MSE training is',np.mean((train_price_spread_rolling_mean - fit)**2))
MSE training is 0.006472583700667411
1
print('MSE testing is',np.mean((test_price_spread_rolling_mean - pred)**2))
MSE testing is 0.008814919494937557

Terribly good in the sense of MSE. Maybe there're some problems... Let's look closer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
%config InlineBackend.figure_format = 'retina'
plt.close()
fig = plt.figure(figsize=(16,9))
ax = fig.add_subplot(111)
# ax.plot(train_price_spread_rolling_mean, lw=1.6, c='k', label='true data')
ax.plot(test_price_spread_rolling_mean, lw=1.6, c='k')
# ax.plot(fit, 'rx-', markevery=30, lw=0.8, label='insample fitted data')
ax.plot(pred, 'bx-', markevery=10, lw=0.8, label='outsample daily prediction')
ax.plot(forecast, 'b--', lw=0.8, label='outsample one monthforecast')
ax.axvline(mid_date, color='k', ls='--', lw=0.8)
ax.fill_between(conf_interval.index, conf_interval.ix[:,0], conf_interval.ix[:,1], color='k', alpha=.1)
ax.xaxis.set_major_locator(mdates.DayLocator(interval=2))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y\n %m%d'))
plt.tight_layout()
plt.legend(loc='upper left')
# plt.title('price spread rolling mean (window: 24 hours)')
plt.show()

TODO: Maximum Price Spread?

Instead of price spread between two exchanges, a more practical (and if successful, more valuable) indicator will be the maximum price spread across the whole market that trades BTC with USD as fiat currency.

Also backtesting should be done (when I have time lol).