On many Kaggle public leaderboards, you will see an algorithm called "XGBoost", or "Xtreme Gradient Boosting". This is an optimized algorithm under the Gradient Boosting framework. The key difference between the normal boosting models and XGBoost, as far as I can see, is that XGBoost allows for parallel processing while keeps the accuracy on its performance. This dramatically makes large-scale machine learning algorithms (since boosting is essentially an ensemble of several weak learning algorithms) possible on personal computers (even on my MBP 2015), and further empowers people like me to apply them onto more practical scenarios, e.g. stock price prediction as we are gonna talk about today. Highest tribute to Tianqi Chen and his great paper!

Performance

Backtest Interval: 20130101 - 20170101 Initial Capital: 1,000,000 Annualized Return (Strategy): TBD Annualized Return (Benchmark): TBD Sharpe Ratio: TBD Maximum Drawdown: TBD

(Performance screenshot not available at present. Will be uploaded later.)

Intuition: Long-term vs. Short-term Prediction

I got the inspiration from this paper. The author raised an interesting but also convincing point when doing stock price prediction: the long-term trend is always easier to predict than the short-term. Again, let's take AAPL for example. The Apple Inc. witnessed a close at $139.14 per share today, a slight rise by 0.33 percent point. What is the expected price change in percentage for, say, tomorrow? Also let's assume that you have access to all the historical data you want, regardless of the size of the dataset or other computational issues? What if an chief engineer suddenly decides to resign from the company and the news leaks instantly through social networks? What if Samsung or another mobile phone company launches its newest prototype right after your prediction? What if, just for fun, the new prototype of Samsung (or whatever) should explode during the presentation, on the spot? The randomness is so changable that our prediction could even be totally the opposite of the true story. However, in a sense of long-term prediction, things get easier, since the overall relationship between Apple and Samsung would least likely change in the near future, and the general performance of AAPL is in principle dependent on how Apple Inc. behave in the following months. So GIVEN THAT COOPER IS NOT INSANE, BUT A RATHER CAUTIOUS CEO, we can expect a higher accuracy when predicting the average price change of AAPL in the next three months (60 tradingdays or so).

Notice again that we are not talking about predicting the close of the day three months from now. We are talking about the AVERAGE price change of the following three months. Now, let's turn to see what predictor we can use to perform the idea.

Tool: Technical Indicators & XGBoost Classification

We are using the classification algorithm in XGBoost. The regression algorithm is slower but more intuitive, but binary classification is just faster in this case, which allows for efficient backtesting later on. The prediction is based on a certain length of historical data including open, high, low, close and volume, along with other 77 techinical indicators that are not extracted until before the prediction. This is of course far from "all historical data" in the market, not to mention the off-market rumors, social and political concerns, etc. But this is enough for illustration here. Therefore, a good amendment for this strategy is surely to append these left-out indicators. For NLA and similar applications in stock price prediction, you can check my post about making Mr. Trump's tweets into prediction indicators. I didn't acutally write the project because I found a really mature application already written on Github, but I would definitely rewrite it myself during the summer vacation. If you find the post not changed even after that, please feel free to mail me and "harshly" urge me for the overdue task.

Codes

The codes mainly consists of three parts: - Definition of techinical indicators. - Two Python classes: classifier and stock. - Main program, including configuration parameters for the backtesting framework.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
import sys
import numpy as np
import tushare as ts
import pandas as pd
import talib as tb
from xgboost import XGBClassifier
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score


# define pivot variables for easy use
def technical(df):
open = df['open'].values
close = df['close'].values
high = df['high'].values
low = df['low'].values
volume = df['volume'].values
# define the technical analysis matrix
retn = np.array([
tb.MA(close, timeperiod=5), # 1
tb.MA(close, timeperiod=10), # 2
tb.MA(close, timeperiod=20), # 3
tb.MA(close, timeperiod=60), # 4
tb.MA(close, timeperiod=90), # 5
tb.MA(close, timeperiod=120), # 6

tb.ADX(high, low, close, timeperiod=20), # 7
tb.ADXR(high, low, close, timeperiod=20), # 8

tb.MACD(close, fastperiod=12, slowperiod=26, signalperiod=9)[0], # 9
tb.RSI(close, timeperiod=14), # 10

tb.BBANDS(close, timeperiod=5, nbdevup=2, nbdevdn=2, matype=0)[0], # 11
tb.BBANDS(close, timeperiod=5, nbdevup=2, nbdevdn=2, matype=0)[1], # 12
tb.BBANDS(close, timeperiod=5, nbdevup=2, nbdevdn=2, matype=0)[2], # 13

tb.AD(high, low, close, volume), # 14
tb.ATR(high, low, close, timeperiod=14), # 15

tb.HT_DCPERIOD(close), # 16

tb.CDL2CROWS(open, high, low, close), # 17
tb.CDL3BLACKCROWS(open, high, low, close), # 18
tb.CDL3INSIDE(open, high, low, close), # 19
tb.CDL3LINESTRIKE(open, high, low, close), # 20
tb.CDL3OUTSIDE(open, high, low, close), # 21
tb.CDL3STARSINSOUTH(open, high, low, close), # 22
tb.CDL3WHITESOLDIERS(open, high, low, close), # 23
tb.CDLABANDONEDBABY(open, high, low, close, penetration=0), # 24
tb.CDLADVANCEBLOCK(open, high, low, close), # 25
tb.CDLBELTHOLD(open, high, low, close), # 26
tb.CDLBREAKAWAY(open, high, low, close), # 27
tb.CDLCLOSINGMARUBOZU(open, high, low, close), # 28
tb.CDLCONCEALBABYSWALL(open, high, low, close), # 29
tb.CDLCOUNTERATTACK(open, high, low, close), # 30
tb.CDLDARKCLOUDCOVER(open, high, low, close, penetration=0), # 31
tb.CDLDOJI(open, high, low, close), # 32
tb.CDLDOJISTAR(open, high, low, close), # 33
tb.CDLDRAGONFLYDOJI(open, high, low, close), # 34
tb.CDLENGULFING(open, high, low, close), # 35
tb.CDLEVENINGDOJISTAR(open, high, low, close, penetration=0), # 36
tb.CDLEVENINGSTAR(open, high, low, close, penetration=0), # 37
tb.CDLGAPSIDESIDEWHITE(open, high, low, close), # 38
tb.CDLGRAVESTONEDOJI(open, high, low, close), # 39
tb.CDLHAMMER(open, high, low, close), # 40
tb.CDLHANGINGMAN(open, high, low, close), # 41
tb.CDLHARAMI(open, high, low, close), # 42
tb.CDLHARAMICROSS(open, high, low, close), # 43
tb.CDLHIGHWAVE(open, high, low, close), # 44
tb.CDLHIKKAKE(open, high, low, close), # 45
tb.CDLHIKKAKEMOD(open, high, low, close), # 46
tb.CDLHOMINGPIGEON(open, high, low, close), # 47
tb.CDLIDENTICAL3CROWS(open, high, low, close), # 48
tb.CDLINNECK(open, high, low, close), # 49
tb.CDLINVERTEDHAMMER(open, high, low, close), # 50
tb.CDLKICKING(open, high, low, close), # 51
tb.CDLKICKINGBYLENGTH(open, high, low, close), # 52
tb.CDLLADDERBOTTOM(open, high, low, close), # 53
tb.CDLLONGLEGGEDDOJI(open, high, low, close), # 54
tb.CDLLONGLINE(open, high, low, close), # 55
tb.CDLMARUBOZU(open, high, low, close), # 56
tb.CDLMATCHINGLOW(open, high, low, close), # 57
tb.CDLMATHOLD(open, high, low, close, penetration=0), # 58
tb.CDLMORNINGDOJISTAR(open, high, low, close, penetration=0), # 59
tb.CDLMORNINGSTAR(open, high, low, close, penetration=0), # 60
tb.CDLONNECK(open, high, low, close), # 61
tb.CDLPIERCING(open, high, low, close), # 62
tb.CDLRICKSHAWMAN(open, high, low, close), # 63
tb.CDLRISEFALL3METHODS(open, high, low, close), # 64
tb.CDLSEPARATINGLINES(open, high, low, close), # 65
tb.CDLSHOOTINGSTAR(open, high, low, close), # 66
tb.CDLSHORTLINE(open, high, low, close), # 67
tb.CDLSPINNINGTOP(open, high, low, close), # 68
tb.CDLSTALLEDPATTERN(open, high, low, close), # 69
tb.CDLSTICKSANDWICH(open, high, low, close), # 70
tb.CDLTAKURI(open, high, low, close), # 71
tb.CDLTASUKIGAP(open, high, low, close), # 72
tb.CDLTHRUSTING(open, high, low, close), # 73
tb.CDLTRISTAR(open, high, low, close), # 74
tb.CDLUNIQUE3RIVER(open, high, low, close), # 75
tb.CDLUPSIDEGAP2CROWS(open, high, low, close), # 76
tb.CDLXSIDEGAP3METHODS(open, high, low, close) # 77
]).T
return retn


class stock:
def __init__(self, id):
self.id = id
self.train = None
self.test = None

def load(self, window_length=120, ratio=0, back_length=10, forward_length=60, test_size=0.25, verbose=False):

# load dataset through tushare
open = history_bars(self.id, window_length, '1d', 'open')
close = history_bars(self.id, window_length, '1d', 'close')
high = history_bars(self.id, window_length, '1d', 'high')
low = history_bars(self.id, window_length, '1d', 'low')
volume = history_bars(self.id, window_length, '1d', 'volume')
df = pd.DataFrame({'open': open, 'close': close, 'high': high, 'low': low, 'volume': volume}).astype(float)
tot_len = len(df)

# use the technical function to load all the technical data
tech = technical(df)

# define the label function, ratio required
def label(df, ratio, back_length, forward_length):
close = df['close']
r = 0.04 # discount rate
mean_diff = sum([-close.diff(-i) * np.exp(-r * i) for i in range(1, forward_length + 1)]) / forward_length
mean_p_change = (mean_diff / close)[back_length: -forward_length]
return (mean_p_change > ratio).astype(int)

# split data into X and y
X = df[['open', 'close', 'high', 'low', 'volume']]
X_shift = [X]
for i in range(1, back_length):
X_shift.append(df[['open', 'close', 'high', 'low', 'volume']].shift(i))
if forward_length == 0:
return np.concatenate([tech, np.log(pd.concat(X_shift, axis=1).values)], axis=1)[back_length:]
X = np.concatenate([tech, np.log(pd.concat(X_shift, axis=1).values)], axis=1)[back_length: -forward_length]
y = label(df, ratio, back_length, forward_length)

# split data into train and test sets
test_len = int(tot_len * test_size)
train_len = tot_len - test_len
X_train, X_test = X[:train_len], X[-test_len:]
y_train, y_test = y[:train_len], y[-test_len:]

# update the train and test dataset
self.train = [X_train, y_train]
self.test = [X_test, y_test]


class classifier:
def __init__(self):
self.model = XGBClassifier(n_estimators=300,
learning_rate=0.1,
max_depth=4,
min_child_weight=4,
subsample=0.6,
colsample_bytree=0.5,
seed=123)

def train(self, data):
X_train, y_train = data.train[0], data.train[1]
X_test, y_test = data.test[0], data.test[1]

self.model.fit(X_train, y_train,
eval_metric='error',
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=20,
verbose=False)

y_pred = self.model.predict(X_train)
predictions = [round(value) for value in y_pred]
# evaluate predictions
train_acc = accuracy_score(y_train, predictions)

# make predictions for test data
y_pred = self.model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
test_acc = accuracy_score(y_test, predictions)

print('{} train acc: {:.2f}%, test acc: {:.2f}%'.format(data.id, train_acc * 100, test_acc * 100))
return test_acc

def predict(self, data):
return self.model.predict(data)


def close_all(context):
for s in context.portfolio.positions:
order_target_percent(s, 0)


def modelize(context, bar_dict):
if context.flag % (context.FL // 20) != 0:
context.flag += 1
return None

print('modelizing')
print('-' * 49)

close_all(context)

context.pool = [
'600799.XSHG',
'600745.XSHG',
'600721.XSHG',
'000979.XSHE',
'600703.XSHG',
'000008.XSHE',
'600139.XSHG',
'000545.XSHE',
'000703.XSHE',
'600804.XSHG', ]
context.stock = [stock(s) for s in context.pool]

print('loading training dataset')
stt = (context.now + relativedelta(months=-context.WL // 20)).strftime('%Y-%m-%d')
end = context.now.strftime('%Y-%m-%d')
for i in range(context.N):
context.stock[i].load(window_length=context.WL,
ratio=context.ratio,
back_length=context.BL,
forward_length=context.FL)
print(context.stock[i].id + ' loaded')
print('-' * 49)

print('training')
for i in range(context.N):
context.score[i] = context.clf[i].train(context.stock[i])
print('-' * 49)

print('predicting')
stt = (context.now + relativedelta(months=-1)).strftime('%Y-%m-%d')
sig = lambda s: '+' if s else '-'
for i in range(context.N):
context.signal[i] = context.clf[i].predict(context.stock[i].load(window_length=context.BL * 2,
ratio=context.ratio,
back_length=context.BL,
forward_length=0))[-1] * \
int(context.score[i] > 0.9)
print('{}: {}'.format(context.stock[i].id, sig(context.signal[i])))
print('-' * 49)

tot = sum(context.signal)
if tot:
for i in range(context.N):
if context.signal[i]:
order_target_percent(context.pool[i], 1 / tot)
else:
close_all(context)


def init(context):
context.N = 10 # number of stocks in pool
context.BL = 20 # backward length
context.FL = 60 # forward length
context.WL = 240 # total window length
context.ratio = 0.000 # critical ratio as threshold
context.flag = 0
context.pool = []
context.stock = []
context.signal = [0] * context.N
context.score = [0] * context.N
context.clf = [classifier()] * context.N
scheduler.run_monthly(modelize, tradingday=1)


__config__ = {
"base": {
"strategy_type": "stock",
"start_date": "2011-01-01",
"end_date": "2017-01-01",
"frequency": "1d",
"matching_type": "next_bar",
"future_starting_cash": 1000000,
"commission_multiplier": 0.01,
"benchmark": "000300.XSHG",
},
"extra": {
"log_level": "error",
},
"mod": {
"progress": {
"enabled": True,
"priority": 400,
},
},
}

References:

  • Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.
  • Dey, Shubharthi, et al. "Forecasting to Classification: Predicting the direction of stock market price using Xtreme Gradient Boosting."