Strategy: XGBoost Classification + Technical Indicators

2017-03-12

On many Kaggle public leaderboards, you will see an algorithm called “XGBoost”, or “Xtreme Gradient Boosting”. This is an optimized algorithm under the Gradient Boosting framework. The key difference between the normal boosting models and XGBoost, as far as I can see, is that XGBoost allows for parallel processing while keeps the accuracy on its performance. This dramatically makes large-scale machine learning algorithms (since boosting is essentially an ensemble of several weak learning algorithms) possible on personal computers (even on my MBP 2015), and further empowers people like me to apply them onto more practical scenarios, e.g. stock price prediction as we are gonna talk about today. Highest tribute to Tianqi Chen and his great paper!

Performance

Statistic	Value
Backtest Interval	20130101 - 20170101
Initial Capital	1,000,000
Annualized Return (Strategy)	TBD
Annualized Return (Benchmark)	TBD
Sharpe Ratio	TBD
Maximum Drawdown	TBD

(Performance screenshot not available at present. Will be uploaded later.)

Intuition: Long-term vs. Short-term Prediction

I got the inspiration from this paper. The author raised an interesting but also convincing point when doing stock price prediction: the long-term trend is always easier to predict than the short-term. Again, let’s take AAPL for example. The Apple Inc. witnessed a close at $139.14 per share today, a slight rise by 0.33 percent point. What is the expected price change in percentage for, say, tomorrow? Also let’s assume that you have access to all the historical data you want, regardless of the size of the dataset or other computational issues? What if an chief engineer suddenly decides to resign from the company and the news leaks instantly through social networks? What if Samsung or another mobile phone company launches its newest prototype right after your prediction? What if, just for fun, the new prototype of Samsung (or whatever) should explode during the presentation, on the spot? The randomness is so changable that our prediction could even be totally the opposite of the true story. However, in a sense of long-term prediction, things get easier, since the overall relationship between Apple and Samsung would least likely change in the near future, and the general performance of AAPL is in principle dependent on how Apple Inc. behave in the following months. So GIVEN THAT COOPER IS NOT INSANE, BUT A RATHER CAUTIOUS CEO, we can expect a higher accuracy when predicting the average price change of AAPL in the next three months (60 tradingdays or so).

Notice again that we are not talking about predicting the close of the day three months from now. We are talking about the AVERAGE price change of the following three months. Now, let’s turn to see what predictor we can use to perform the idea.

Tool: Technical Indicators & XGBoost Classification

We are using the classification algorithm in XGBoost. The regression algorithm is slower but more intuitive, but binary classification is just faster in this case, which allows for efficient backtesting later on. The prediction is based on a certain length of historical data including open, high, low, close and volume, along with other 77 techinical indicators that are not extracted until before the prediction. This is of course far from “all historical data” in the market, not to mention the off-market rumors, social and political concerns, etc. But this is enough for illustration here. Therefore, a good amendment for this strategy is surely to append these left-out indicators. For NLA and similar applications in stock price prediction, you can check my post about making Mr. Trump’s tweets into prediction indicators. I didn’t acutally write the project because I found a really mature application already written on Github, but I would definitely rewrite it myself during the summer vacation. If you find the post not changed even after that, please feel free to mail me and “harshly” urge me for the overdue task.

Codes

The codes mainly consists of three parts:

Definition of techinical indicators.
Two Python classes: classifier and stock.
Main program, including configuration parameters for the backtesting framework.

import sys
import numpy as np
import tushare as ts
import pandas as pd
import talib as tb
from xgboost import XGBClassifier
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score


# define pivot variables for easy use
def technical(df):
    open = df['open'].values
    close = df['close'].values
    high = df['high'].values
    low = df['low'].values
    volume = df['volume'].values
    # define the technical analysis matrix
    retn = np.array([
        tb.MA(close, timeperiod=5),                                         # 1
        tb.MA(close, timeperiod=10),                                        # 2
        tb.MA(close, timeperiod=20),                                        # 3
        tb.MA(close, timeperiod=60),                                        # 4
        tb.MA(close, timeperiod=90),                                        # 5
        tb.MA(close, timeperiod=120),                                       # 6

        tb.ADX(high, low, close, timeperiod=20),                            # 7
        tb.ADXR(high, low, close, timeperiod=20),                           # 8

        tb.MACD(close, fastperiod=12, slowperiod=26, signalperiod=9)[0],    # 9
        tb.RSI(close, timeperiod=14),                                       # 10

        tb.BBANDS(close, timeperiod=5, nbdevup=2, nbdevdn=2, matype=0)[0],  # 11
        tb.BBANDS(close, timeperiod=5, nbdevup=2, nbdevdn=2, matype=0)[1],  # 12
        tb.BBANDS(close, timeperiod=5, nbdevup=2, nbdevdn=2, matype=0)[2],  # 13

        tb.AD(high, low, close, volume),                                    # 14
        tb.ATR(high, low, close, timeperiod=14),                            # 15

        tb.HT_DCPERIOD(close),                                              # 16

        tb.CDL2CROWS(open, high, low, close),                               # 17
        tb.CDL3BLACKCROWS(open, high, low, close),                          # 18
        tb.CDL3INSIDE(open, high, low, close),                              # 19
        tb.CDL3LINESTRIKE(open, high, low, close),                          # 20
        tb.CDL3OUTSIDE(open, high, low, close),                             # 21
        tb.CDL3STARSINSOUTH(open, high, low, close),                        # 22
        tb.CDL3WHITESOLDIERS(open, high, low, close),                       # 23
        tb.CDLABANDONEDBABY(open, high, low, close, penetration=0),         # 24
        tb.CDLADVANCEBLOCK(open, high, low, close),                         # 25
        tb.CDLBELTHOLD(open, high, low, close),                             # 26
        tb.CDLBREAKAWAY(open, high, low, close),                            # 27
        tb.CDLCLOSINGMARUBOZU(open, high, low, close),                      # 28
        tb.CDLCONCEALBABYSWALL(open, high, low, close),                     # 29
        tb.CDLCOUNTERATTACK(open, high, low, close),                        # 30
        tb.CDLDARKCLOUDCOVER(open, high, low, close, penetration=0),        # 31
        tb.CDLDOJI(open, high, low, close),                                 # 32
        tb.CDLDOJISTAR(open, high, low, close),                             # 33
        tb.CDLDRAGONFLYDOJI(open, high, low, close),                        # 34
        tb.CDLENGULFING(open, high, low, close),                            # 35
        tb.CDLEVENINGDOJISTAR(open, high, low, close, penetration=0),       # 36
        tb.CDLEVENINGSTAR(open, high, low, close, penetration=0),           # 37
        tb.CDLGAPSIDESIDEWHITE(open, high, low, close),                     # 38
        tb.CDLGRAVESTONEDOJI(open, high, low, close),                       # 39
        tb.CDLHAMMER(open, high, low, close),                               # 40
        tb.CDLHANGINGMAN(open, high, low, close),                           # 41
        tb.CDLHARAMI(open, high, low, close),                               # 42
        tb.CDLHARAMICROSS(open, high, low, close),                          # 43
        tb.CDLHIGHWAVE(open, high, low, close),                             # 44
        tb.CDLHIKKAKE(open, high, low, close),                              # 45
        tb.CDLHIKKAKEMOD(open, high, low, close),                           # 46
        tb.CDLHOMINGPIGEON(open, high, low, close),                         # 47
        tb.CDLIDENTICAL3CROWS(open, high, low, close),                      # 48
        tb.CDLINNECK(open, high, low, close),                               # 49
        tb.CDLINVERTEDHAMMER(open, high, low, close),                       # 50
        tb.CDLKICKING(open, high, low, close),                              # 51
        tb.CDLKICKINGBYLENGTH(open, high, low, close),                      # 52
        tb.CDLLADDERBOTTOM(open, high, low, close),                         # 53
        tb.CDLLONGLEGGEDDOJI(open, high, low, close),                       # 54
        tb.CDLLONGLINE(open, high, low, close),                             # 55
        tb.CDLMARUBOZU(open, high, low, close),                             # 56
        tb.CDLMATCHINGLOW(open, high, low, close),                          # 57
        tb.CDLMATHOLD(open, high, low, close, penetration=0),               # 58
        tb.CDLMORNINGDOJISTAR(open, high, low, close, penetration=0),       # 59
        tb.CDLMORNINGSTAR(open, high, low, close, penetration=0),           # 60
        tb.CDLONNECK(open, high, low, close),                               # 61
        tb.CDLPIERCING(open, high, low, close),                             # 62
        tb.CDLRICKSHAWMAN(open, high, low, close),                          # 63
        tb.CDLRISEFALL3METHODS(open, high, low, close),                     # 64
        tb.CDLSEPARATINGLINES(open, high, low, close),                      # 65
        tb.CDLSHOOTINGSTAR(open, high, low, close),                         # 66
        tb.CDLSHORTLINE(open, high, low, close),                            # 67
        tb.CDLSPINNINGTOP(open, high, low, close),                          # 68
        tb.CDLSTALLEDPATTERN(open, high, low, close),                       # 69
        tb.CDLSTICKSANDWICH(open, high, low, close),                        # 70
        tb.CDLTAKURI(open, high, low, close),                               # 71
        tb.CDLTASUKIGAP(open, high, low, close),                            # 72
        tb.CDLTHRUSTING(open, high, low, close),                            # 73
        tb.CDLTRISTAR(open, high, low, close),                              # 74
        tb.CDLUNIQUE3RIVER(open, high, low, close),                         # 75
        tb.CDLUPSIDEGAP2CROWS(open, high, low, close),                      # 76
        tb.CDLXSIDEGAP3METHODS(open, high, low, close)                      # 77
    ]).T
    return retn


class stock:
    def __init__(self, id):
        self.id = id
        self.train = None
        self.test = None

    def load(self, window_length=120, ratio=0, back_length=10, forward_length=60, test_size=0.25, verbose=False):

        # load dataset through tushare
        open = history_bars(self.id, window_length, '1d', 'open')
        close = history_bars(self.id, window_length, '1d', 'close')
        high = history_bars(self.id, window_length, '1d', 'high')
        low = history_bars(self.id, window_length, '1d', 'low')
        volume = history_bars(self.id, window_length, '1d', 'volume')
        df = pd.DataFrame({'open': open, 'close': close, 'high': high, 'low': low, 'volume': volume}).astype(float)
        tot_len = len(df)

        # use the technical function to load all the technical data
        tech = technical(df)

        # define the label function, ratio required
        def label(df, ratio, back_length, forward_length):
            close = df['close']
            r = 0.04  # discount rate
            mean_diff = sum([-close.diff(-i) * np.exp(-r * i) for i in range(1, forward_length + 1)]) / forward_length
            mean_p_change = (mean_diff / close)[back_length: -forward_length]
            return (mean_p_change > ratio).astype(int)

        # split data into X and y
        X = df[['open', 'close', 'high', 'low', 'volume']]
        X_shift = [X]
        for i in range(1, back_length):
            X_shift.append(df[['open', 'close', 'high', 'low', 'volume']].shift(i))
        if forward_length == 0:
            return np.concatenate([tech, np.log(pd.concat(X_shift, axis=1).values)], axis=1)[back_length:]
        X = np.concatenate([tech, np.log(pd.concat(X_shift, axis=1).values)], axis=1)[back_length: -forward_length]
        y = label(df, ratio, back_length, forward_length)

        # split data into train and test sets
        test_len = int(tot_len * test_size)
        train_len = tot_len - test_len
        X_train, X_test = X[:train_len], X[-test_len:]
        y_train, y_test = y[:train_len], y[-test_len:]

        # update the train and test dataset
        self.train = [X_train, y_train]
        self.test = [X_test, y_test]


class classifier:
    def __init__(self):
        self.model = XGBClassifier(n_estimators=300,
                                   learning_rate=0.1,
                                   max_depth=4,
                                   min_child_weight=4,
                                   subsample=0.6,
                                   colsample_bytree=0.5,
                                   seed=123)

    def train(self, data):
        X_train, y_train = data.train[0], data.train[1]
        X_test, y_test = data.test[0], data.test[1]

        self.model.fit(X_train, y_train,
                       eval_metric='error',
                       eval_set=[(X_train, y_train), (X_test, y_test)],
                       early_stopping_rounds=20,
                       verbose=False)

        y_pred = self.model.predict(X_train)
        predictions = [round(value) for value in y_pred]
        # evaluate predictions
        train_acc = accuracy_score(y_train, predictions)

        # make predictions for test data
        y_pred = self.model.predict(X_test)
        predictions = [round(value) for value in y_pred]
        # evaluate predictions
        test_acc = accuracy_score(y_test, predictions)

        print('{} train acc: {:.2f}%, test acc: {:.2f}%'.format(data.id, train_acc * 100, test_acc * 100))
        return test_acc

    def predict(self, data):
        return self.model.predict(data)


def close_all(context):
    for s in context.portfolio.positions:
        order_target_percent(s, 0)


def modelize(context, bar_dict):
    if context.flag % (context.FL // 20) != 0:
        context.flag += 1
        return None

    print('modelizing')
    print('-' * 49)

    close_all(context)

    context.pool = [
        '600799.XSHG',
        '600745.XSHG',
        '600721.XSHG',
        '000979.XSHE',
        '600703.XSHG',
        '000008.XSHE',
        '600139.XSHG',
        '000545.XSHE',
        '000703.XSHE',
        '600804.XSHG', ]
    context.stock = [stock(s) for s in context.pool]

    print('loading training dataset')
    stt = (context.now + relativedelta(months=-context.WL // 20)).strftime('%Y-%m-%d')
    end = context.now.strftime('%Y-%m-%d')
    for i in range(context.N):
        context.stock[i].load(window_length=context.WL,
                              ratio=context.ratio,
                              back_length=context.BL,
                              forward_length=context.FL)
        print(context.stock[i].id + ' loaded')
    print('-' * 49)

    print('training')
    for i in range(context.N):
        context.score[i] = context.clf[i].train(context.stock[i])
    print('-' * 49)

    print('predicting')
    stt = (context.now + relativedelta(months=-1)).strftime('%Y-%m-%d')
    sig = lambda s: '+' if s else '-'
    for i in range(context.N):
        context.signal[i] = context.clf[i].predict(context.stock[i].load(window_length=context.BL * 2,
                                                                         ratio=context.ratio,
                                                                         back_length=context.BL,
                                                                         forward_length=0))[-1] * \
            int(context.score[i] > 0.9)
        print('{}: {}'.format(context.stock[i].id, sig(context.signal[i])))
    print('-' * 49)

    tot = sum(context.signal)
    if tot:
        for i in range(context.N):
            if context.signal[i]:
                order_target_percent(context.pool[i], 1 / tot)
    else:
        close_all(context)


def init(context):
    context.N = 10          # number of stocks in pool
    context.BL = 20         # backward length
    context.FL = 60         # forward length
    context.WL = 240        # total window length
    context.ratio = 0.000   # critical ratio as threshold
    context.flag = 0
    context.pool = []
    context.stock = []
    context.signal = [0] * context.N
    context.score = [0] * context.N
    context.clf = [classifier()] * context.N
    scheduler.run_monthly(modelize, tradingday=1)


__config__ = {
    "base": {
        "strategy_type": "stock",
        "start_date": "2011-01-01",
        "end_date": "2017-01-01",
        "frequency": "1d",
        "matching_type": "next_bar",
        "future_starting_cash": 1000000,
        "commission_multiplier": 0.01,
        "benchmark": "000300.XSHG",
    },
    "extra": {
        "log_level": "error",
    },
    "mod": {
        "progress": {
            "enabled": True,
            "priority": 400,
        },
    },
}

References:

Chen, Tianqi, and Carlos Guestrin. “Xgboost: A scalable tree boosting system.” Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.
Dey, Shubharthi, et al. “Forecasting to Classification: Predicting the direction of stock market price using Xtreme Gradient Boosting.”