This is a simple print
function overwritten so that you can specify different colors in Terminal outputs.
To use this feature, you'll need to import this customized print
function from the ColorPrint
package, the GitHub repo is here.
1  from ColorPrint import print 
and the output is as below (in Terminal):
I find this especially useful when you're trying to focus on commandline workflow only and don't want to build you own wheel over and over again.
]]>This is the first post of my ambitious plan trying to enumerate as many key points about the C++ language as I can. These notes are only for personal reviewing purposes and shall definitely be used commercially by anyone interested. Just please comment below for any missing C++ syntax or features. 👍🏻
Basically there're only one thing that needs attention. For standard libraries we use <
and >
and for local libraries we use quotes.
1 
These are files we declare functions and classes we want to use or implement in main files.
In C++, by loading the iostream
library we can read and write by
1 

Normally libraries come with classes, e.g. for iostream
we need to use std
everytime we need to print something. With namespaces we can reduce redundance.
1 

We may also use using namespace std;
which sometimes can cause problems.
There are a variety of data types in C++. For real numbers we have
Type  Bytes  Range 

float  \(4\)  \(\pm 3.4E\pm38\) 
double  \(8\)  \(\pm 1.7E\pm308\) 
where we should pay enough attention to the \(\pm\). For general integers we have
Type  Bytes  Range 

short  \(2\)  \(2E15\) to \(2E151\) 
int  \(4\)  \(2E31\) to \(2E311\) 
long  \(8\)  \(2E63\) to \(2E631\) 
and for each type we also have an unsigned version that starts from 0 and covers the same length of range.
We may notice that long
has a smaller range than float
despite the fact that the first data type actually costs more bytes than the latter. This is because the 4 bytes (or 32 bits) of a float
\(V\) is not stored equally in RAM, but rather
\[V = (1)^S \cdot M \cdot 2^E\]
where \(S\) is the first bit, and \(E\) the second through the ninth bits, and \(M\) for the tenth and so forth. So in a sense, because float
is more "sparse", the long
type has a smaller range.
Apart from other fundamental types like char
and bool
, we can also define our own data types or use types defined in libraries, e.g. std::string
. We may also use type aliases like
1  typedef double OptionPrice; 
We have operators for fundamental types:
Function  Operator 

assignment  = 
arithmetic  +  * / 
comparison  > < <= >= 
equality/nonequality  == != 
logical  &&  
modulo  % 
In C++ there're a set of shortcuts as follows:
Full Operator  Shortcut 

i = i + 1;  i++; i += 1; 
i = i  1;  i; i = 1; 
i = i * 1;  i *= 1; 
i = i / 1;  i /= 1; 
We may also use prefix and postfix in assignment, which are totally different. After
1  int x = 3; 
we have \(x = 4\) and \(y = 3\). After
1  int x = 3; 
we have \(x = y = 3\).
A general template for a C++ function:
1  resType f(argType1 arg1, argType2 arg2, ...) { 
Notice we may write multiple functions with the same resType
but with different arguments, which we call "parameter overloading". Meanwhile, even withou parameter overloading we can still use a function of double
on int
, because int
takes up less bytes and the implicit conversion is safe. We call it widening or promotion. In contrast, narrowing can be dangerous and cause a build warning.
In C++ we have two kinds of comments.
1  // This is inline comment 
In C++, people usually pass variables into functions by two methods: either by value or by reference. The first way creates a copy of the variable and nothing will happen to the original one. For the second, anything we do in the function will take effect on the original variable itself. The original variable must be declared once we create a reference, so
1  int x = 1; 
will compile, while below will not:
1  int x = 1; 
References can be extremely useful, especially when the original variable is a large object and making a copy costs considerable time and memory. However, this is potentially risky when we don't want to mess up with the original object when calling a function. So we need const references.
There're two situations we should take care when using the const
keyword with references. First, we can make a reference of a const variable, and we cannot change the value of it:
1  const int x = 1; 
We may also bind a const reference to a variable when the original itself is not const:
1  int y = 1; 
In this case we avoid making a copy while also keep the original variable safe from unexpected editing.
There is a third way of passing a variable, that is pointers. Pointers are variables the points to their addresses in memory. We declare a pointer by
1  int* pi; // legal but bad without initialization 
which comes with two unique operators: &
for the address of a variable, and \(*\) for the dereference of a pointer.
1  int i = 123; 
123
You can create a pointer pointing to a piece of dynamic memory for later deletion, in case memory is being an issue in your program.
1  int *p = new int; 
You can have pointers to a const variable, i.e. you cannot change its value through pointers.
1  const int x = 1; 
You can also have const pointers to variables, then you can change the value of the variable but never again the pointer (address) itself.
1  int x = 1; 
You can also have const pointers to const variables.
1  const int x = 1; 
Below is a general template for if/else
structures in C++.
1  if (condition1) { 
When there're multiple conditions, we can also use the switch
keyword.
1  switch (expression) { 
One of the most popular loops is while
loop.
1  while (condition) { 
It also has a variant called the do/while
loop.
1  do { 
which is slightly different from the while
loop in sequence.
Another form of loop that keeps track of the iterator precisely.
1  for (initializer; condition; statement1) { 
There is an unwritten rule that we usually write ++i
in statement1
because compared with i++
which need to make a copy, ++i
is more efficient. However, it's arguably correct because modern compilers can surely optimize this defect.
A simple but intuitive example of classes is to describe a people in C++ (here we assume type string
under the namespace std
is used):
1  class Person { 
We can also implement member functions in the class, just to make it more convenient:
1  class Person { 
We have three levels of data protection in a class:
This means we can protect data in the class by declaring them as private while get access to them via public member functions:
1  class Person { 
An instance created based on a class is called an object. To create an object, we may need a constructor, a copy constructor and a destructor.
1  class Person { 
Person
name_
.According to this coding style we have in Person.h
1 

In Person.cpp
we implement the member functions of the class:
1  string Person::GetEmail() { 
Just keep in mind that the constructors as well as the destructor should also be implemented:
1  Person::Person() { 
We can also use the colon syntax for constructors:
1  Person::Person() : name_(""), email_(""), stu_id_(0) {} 
a struct
is a class
with one difference: struct
members are by default public, while class
members are by default private.
1  struct Person { 
For a newly created class we cannot use person2 = person1
if we want to assign the whole object person1
to person2
. We have to use constructors. What we can do, instead, is to overload these operators (e.g. the assignment operator =
) specifically for the class.
The overloadable operators include +

*
/
%
^
&

~
!
=
<
>
<=
>=
++

<<
>>
==
!=
&&

+=
=
*=
/=
&=
=
^=
%=
<<=
>>=
[]
()
>
>*
new
new[]
delete
delete[]
.
The nonoverloadable operators are ::
.*
.
?=
.
1  void Person::operator=(const Person& another_person) { 
However, such overloading does not support chain assignment like person3 = person2 = person1
. We need to return a reference in order to support that.
1  Person& Person::operator=(const Person& another_person) { 
where this
is a pointer pointing to the object itself.
Another concern is selfassignment, which in some cases can be dangerous and in almost every situation is inefficient. To avoid selfassignment we need to detect and skip it.
1  Person& Person::operator=(const Person& another_person) { 
In C++, a function can only be defined once. This is called the One Definition Rule (ODR). To avoid multiple including of the header files, we use include guards. This is being done by defining a macro at the beginning of each header file.
1 

Here we introduce two of the most useful containers in the C++ Standard Library: std::vector
and std::map
. To initialize an empty vector, we use
1 

and to initialize with a specific size, we do
1 

On the other hand, map
containers are like dict
in Python, which allows you to use indiced of any type, e.g. std::string
.
1 

Data abstraction refers to the separation of interface (public functions of the class) and implementation:
Encapsulation refers to combining data and functions inside a class so that data is only accessed through the functions in the class.
We can declare friend
a function or class s.t. they can get access to the private and protected members of the base class.
1  class MyClass { 
and you can implement and use this function change_data
globally in the function to change my_data
.
Inheritance refers to based on the existing classes trying to:
A simple example would be
1  class Student { 
with meanwhile
1  class Employee { 
Apparently a lot of functions and data are repeated. What we're gonna do is to build a base class and reuse it onto two derived classes. Note:
In actual coding, this is what we do:
1  class Person { 
with
1  class Student : public Person { 
and
1  class Employee : public Person { 
To initialize a base class, we define constructors just like what we did before:
1  Person::Person(string name, string email) : name_(name), email_(email) {} 
while for derived classes, we need to call the base class constructor
1  Student::Student(string name, string email, string major) : Person(name, email), major_(major) {} 
A derived class can access members in the base class, subject to protection level restrictions. Protection levels public and private have their regular meanings in an inheritance class hierarchy:  A derived class cannot access private members of a base class.  A derived class can access public members of a base class.
A derived class can also access protected members of a base class. If a class has protected members:  That class can access them  A derived class of that class can access them  Everone else cannot access them
A base class uses the virtual
keyword to allow a derived class to override (provide a different implementation) a member function. If a function is virtual (in the base class):  The base class provides an implementation for that function; we call it the default implementation  A derived classes inherit the function interface (definition) as well as the default implementation  A derived class can provide a different implementation for that function (but it does not have to)
1  class Base1 { 
and then functions like Fun1
will be revisable in inheritance. Note that the base class has to implement all functions no matter they're virtual or not.
If we don't give a default implementation of a virtual function, we call it pure virtual. This is been done by assigning =0
at the time of definition.
1  class Base2 { 
In this case the base class does not need to implement this Fun1
and in contrast, the derived class must do so. A class with virtual functions is called an abstract class. Note that we cannot instantiate (make an object of) an abstract class until every virtual function is implemented.
There's a slight difference between normal member functions, virutal functions and pure virtual functions during inheritance.
We use a pointer or a reference to a base class object to point to an object of a derived class, which we call the Liskov Substitution Principle (LSP).
1  Option* option1 = nullptr; 
More direct example may be as follows. Instead of writing separately
1  double Price(EuropeanCall option, ...) { 
we can use polymorphism and write it w.r.t. the base class Option
using a reference or pointer
1  double Price(Option& option, ...) { 
For variables we declare constancy by
1  const int val = 10; 
For constant objects, e.g.
1  class Student { 
when we call
1  const Student a('Allen', 'allen@gmail.com'); 
we meed a compile error. This is due to that the compiler does not know the function GetName
is constant. To declare that we need
1  class Student { 
When we have pure virtual constant member functions, we write like this: virtual type f(...) const = 0
.
A const member function cannot modify data members. The only exception of this issue is mutable data members.
1  class Student { 
The override
keyword serves two purposes:
1  class base { 
In implementation of the pure virtual function foo
in derived class derived1
, we're doing just as told by the base class. In derive2
, with the override
keyword we'll get an error for overwriting the original virtual function by changing types; while without this keyword we'll get at most just a warning.
For nonstatic member we change an instance's data and it's done. Nothing will happen to other instances of the same derived class. For a static member function/data the association is built and we can change one and for all.
1  class Counter { 
A regular function is generally
1  int AddOne(int x) {return x + 1;} 
while a function object implementation is
1  class AddOne { 
and for the latter we can use its instances as objects, which still work as functions.
1  vector<int> values{1, 2, 4, 5}; 
where AddOne()
is an unnamed instance of the class AddOne
.
In C++ we have inline function definition as
1  int f = [](int x, int y) { return x + y; }; 
The []
is called the capture operator and it has rules as follows.
[=]
captures everything by value (read but no write access)[&]
captures everything by reference (read and write access)[=,&x]
captures everything but x by value, and for x by reference[&,x]
captures everything but x by reference, and for x by valueBelow we introduce some features in STL.
Two of the most commenly seen methods are begin()
and end()
We have binary_search
, for_each
, find_if
and sort
.
In the STL, algorithms are implemented as functions, and data types in containers.
1  int main() { 
1  int main() { 
1  bool PersonSortCriterion(const Person& p1, const Person& p2) { 
By combining STL algorithms with lambdas in C++ can be very efficient. We can use lambdas in a loop without defining a function beforehand.
1  vector<int> v{1, 3, 2, 4, 6}; 
We can also use it as a sorting criterion
1  std::vector<Person> ppl; 
We can have templates of a function:
1  template <class T> T sum(T a, T b) {return a + b; } 
We can also have templates of a class:
1  template <class T> 
1  int x, y; 
1 

Specifically, for the open modes we have
Mode  Description 

ios::app  Append to the end 
ios::ate  Go to the end of file on opening 
ios::binary  Open in binary mode 
ios::in  Open file for reading only 
ios::out  Open file for writing only 
ios::nocreate  Fail if you have to create it 
ios::noreplace  Fail if you have to replace 
ios::trunc  Remove all content if the file exists 
It's been months since my last update on cryptocurrency arbitrage strategies. The original version has been completely driven off the market and thus I decided to develop a new one. The market is primitive and savage in many senses, by which I mean there're supposed to be a bunch of inefficiency and corresponding arbitrage opportunities.
On the top of the page is the backtest PnL of the new strategy 4 from 01/01 up to yesterday, 07/26. I used 1 minute historical orderbook data, 5 spreads for slippage (not sure if it's still too conservative, need testing) and benchmarked the simplest buyandhold strategy. It's known that the whole crypto market has experienced a huge slump since late last year, so I guess my trick works quite well. The strategy is now running on my AWS in real money and I'll update this post whenever any interesting (or frustrating) issue happens.
Cheers.
Update Aug 3:
I changed the screening window length and the performance (of backtest) increased over tenfold. The image on the top has been updated with Sharpe ratios labelled.
]]>It's not hard to write a swap function. The most orthodox way that's being used in C++ or Java is by using a temporary variable. For example, say we have a = 0
and b = 1
, and we'd like to swap the values of these two variables. The pseudocode shall be something as below.
1  temp = a 
However, a more "Pythonic" way to do so is by literally "swapping" the values in place. Specifically, we don't even need to define a function for it, so the title picture is actually nonsense.
1  a, b = b, a 
How is that handled inside Python? Before answering that question, how is "Pythonic" defined? Well, Pythonic means code that doesn't just get the syntax right but that follows the conventions of the Python community and uses the language in the way it is intended to be used (Abien Fred Agarap^{[1]}). Talking about the conventions of the Python community, we won't be able to miss the famous Zen of Python:
1  import this 
Beautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.Special cases aren't special enough to break the rules.Although practicality beats purity.Errors should never pass silently.Unless explicitly silenced.In the face of ambiguity, refuse the temptation to guess.There should be one and preferably only one obvious way to do it.Although that way may not be obvious at first unless you're Dutch.Now is better than never.Although never is often better than *right* now.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.Namespaces are one honking great idea  let's do more of those!
So our oneline swapping exactly follows these supreme principles: it's beautiful, explicit, simple and perfectly readable. The only remaining question is, what happened when we called a, b = b, a
, and what are the technical differences between this lazy trick and the orthodox one?
Well, here is the thing. Just like most other programming languages, Python also handles assignment statements in a righttoleft manner. So before we actually assign the value of a
to b
and vice versa, Python pakcages the RHS as a tuple temporarily stored in memory. Then it assigns the values of this tuple to the LHS in order. That it. As a result, different from the orthodox swap function which creates a temporary variable temp
staying in our memory until being collected manually (if we're using it in the global environment) or after the function is destroyed, the Pythonic swap occupies doubled memory yet frees automatically thanks to Python's garbage collection. That's kind of a tradeoff and in some cases when absolute available memory is critically short, we might be suggested to use the more orthodox swap function.
Just as a supplement, there is in fact a way to swap in place while avoid using doubled memory. The trick is illustrated as follows.
1  a = a + b 
In case of large integers, we may also use XOR functions:
1  a = a ^ b 
My trading bot just ceased this morning from its loyal 24/7 service. It's running on an Amazon EC2 server with Ubuntu 16.04
and I'm sure this time I'm not having an unpaidbill issue any more. After some time digging I think I finally figure out the cause of this unexpected strike  asynchronism.
Asynchronism, or in simple terms, timing discrepancy, usually means a tiny bit of difference between local time on your computer/server and the global NTP time. It can be as undetectable as several milliseconds but in some applications like trading, such discrepancies are reckoned intolerable and any request sent from such computers/servers are ruthlessly rejected. Computers are just machines and they cannot be accurate in time forever. That's why we need (time) synchronization. In fact, EC2 does have such regular synchronization built in, but it seems it only happens once after a rather long period, like days. In order to adjust the synchronization period length to avoid similar issues in the future, I'll need the Amazon Time Sync Service.
First we install the chrony
package for synchronization, and open its configuration.
1  sudo apt install chrony 
Append in the opened chrony.conf
file a line as follows.
1  server 169.254.169.123 prefer iburst 
Restart chrony
service.
1  sudo /etc/init.d/chrony restart 
[ ok ] Restarting chrony (via systemctl): chrony.service.
Make sure that chrony
is successfully synchronizing time from 169.254.169.123
1  chronyc sources v 
210 Number of sources = 7 . Source mode '^' = server, '=' = peer, '#' = local clock. / . Source state '*' = current synced, '+' = combined , '' = not combined, / '?' = unreachable, 'x' = time may be in error, '~' = time too variable. . xxxx [ yyyy ] +/ zzzz Reachability register (octal) .  xxxx = adjusted offset, Log2(Polling interval) .   yyyy = measured offset, \   zzzz = estimated error.   \MS Name/IP address Stratum Poll Reach LastRx Last sample===============================================================================^* 169.254.169.123 3 6 17 12 +15us[ +57us] +/ 320us^ tbag.heanet.ie 1 6 17 13 3488us[3446us] +/ 1779us^ ec2123423112.euwest 2 6 17 13 +893us[ +935us] +/ 7710us^? 2a05:d018:c43:e312:ce77:6 0 6 0 10y +0ns[ +0ns] +/ 0ns^? 2a05:d018:d34:9000:d8c6:5 0 6 0 10y +0ns[ +0ns] +/ 0ns^? tshirt.heanet.ie 0 6 0 10y +0ns[ +0ns] +/ 0ns^? bray.walcz.net 0 6 0 10y +0ns[ +0ns] +/ 0ns
where ^*
denotes the preferred time source.
Finally, check the synchronization report.
1  chronyc tracking 
Reference ID : 169.254.169.123 (169.254.169.123)Stratum : 4Ref time (UTC) : Thu Jul 12 16:41:57 2018System time : 0.000000011 seconds slow of NTP timeLast offset : +0.000041659 secondsRMS offset : 0.000041659 secondsFrequency : 10.141 ppm slowResidual freq : +7.557 ppmSkew : 2.329 ppmRoot delay : 0.000544 secondsRoot dispersion : 0.000631 secondsUpdate interval : 2.0 secondsLeap status : Normal
As a conclusion, the server is now synchronizing time to the assigned source every 2 seconds and we shall never encounter similar issues.
]]>This was the last photo taken before we left Giethoorn, a small yet heavenly village. Hundreds of fragments are surrounded by tiny rivers and connected by wooden bridges only longer than a car. Talking about cars, the village was carfree and people commute by boats or bikes. We also love the thatchedroof houses which I suppose had been standing there for centuries, along with the wheat fields and the huge reed marshes.
The photo is probably my favorite shot throughout the past two years  if it is better than the foggymorning one taken in Hallstatt, Austria at the foot of the Alps.
]]>This is a note of Linear Discriminant Analysis (LDA) and an original Regularized Matrix Discriminant Analysis (RMDA) method proposed by Jie Su et al, 2018. Both methods are suitable for efficient multiclass classification, while the latter is a stateoftheart version of the classical LDA method s.t. data in matrix forms can be classified without destroying the original structure.
The plain idea behind Discriminant Analysis is to find the optimal partition (or projection, for higherdimensional problems) s.t. entities within the same class are distributed as compactly as possible and entities between classes are distributed as sparsely as possible. To derive closedform solutions we have various conditions on the covariance matrices of the input data. When we assume covariances \(\boldsymbol{\Sigma}\_k\) are equal for all classes \(k\in\{1,2,\ldots,K\}\), we're following the framework of Linear Discriminant Analysis (LDA).
As shown above, when we consider a 2dimensional binary classification problem, the LDA is equivalently finding the optimal direction vector \(\boldsymbol{w}\) s.t. the ratio of \(\boldsymbol{w}^T\boldsymbol{S}\_b\boldsymbol{w}\) (sum of betweenclass covariances of the projections) and \(\boldsymbol{w}^T\boldsymbol{S}\_w\boldsymbol{w}\) (sum of withinclass covariances of the projections) is maximized. Specifically, we define
\[\boldsymbol{S}_b = (\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)^T(\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)\]
and
\[\boldsymbol{S}_w = \sum_{\boldsymbol{x}\in X_0}(\boldsymbol{x}  \boldsymbol{\mu}_0)^T(\boldsymbol{x}  \boldsymbol{\mu}_0) + \sum_{\boldsymbol{x}\in X_1}(\boldsymbol{x}  \boldsymbol{\mu}_1)^T(\boldsymbol{x}  \boldsymbol{\mu}_1).\]
Therefore, the objective of this maximization problem is
\[J = \frac{\boldsymbol{w}^T\boldsymbol{S}_b\boldsymbol{w}}{\boldsymbol{w}^T\boldsymbol{S}_w\boldsymbol{w}}\]
which is also called the generalized Rayleigh quotiet.
The homogenous objective can be equivalently written into
\[\begin{align}\min_{\boldsymbol{w}}\quad &\boldsymbol{w}^T\boldsymbol{S}_b\boldsymbol{w}\\\\\text{s.t.}\quad &\boldsymbol{w}^T\boldsymbol{S}_w\boldsymbol{w} = 1\end{align}\]
which, by using the method of Langrange multipliers, gives solution
\[\boldsymbol{w} = \boldsymbol{S}_w^{1}(\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)\]
and the final prediction for new data \(\boldsymbol{x}\) is based on the scale of \(\boldsymbol{w}^T\boldsymbol{x}\).
For multiclass classification, the solution is similar. Here we propose the score function below without derivation:
\[\delta_k = \boldsymbol{x}^T\boldsymbol{\Sigma}^{1}\boldsymbol{\mu}_k  \frac{1}{2}\boldsymbol{\mu}_k^T\boldsymbol{\Sigma}^{1}\boldsymbol{\mu}_k + \log\pi_k\]
where \(\boldsymbol{\mu}\_k\) is the sample mean of all data within class \(k\), and \(\pi_k\) is the percentage of all data that is of this class. By comparing these \(k\) scores we determine the best prediction with the highest value.
We first load necessary packages.
1  %config InlineBackend.figure_format = 'retina' 
Now we define a new class called LDA
with a predict
(in fact also predict_prob
) method.
1  class LDA: 
Then we define three classes of 2D input \(\boldsymbol{X}\) and pass it to the classifier. Original as well as the predicted distributions are plotted with accuracy printed below.
1  np.random.seed(2) 
Training accuracy: 95.67%
For data with inherent matrix forms like electroencephalogram (EEG) data introduced in Jie Su (2018), the classical LDA is not the most appropriate solution since it forcibly requires vector input. To use LDA for classification on such datasets we have to vectorize the matrices and potentially losing some critical structural information. Authors of this paper invented this new method called Regularized Matrix Discriminant Analysis (RMDA) that naturally takes matric input in analysis. Furthermore, noticing that inversing large matrix \(\boldsymbol{S}_w\) in high dimensions can be computationally burdonsome, they adopted the Alternating Direction Method of Multipliers (ADMM) to iteratively optimize the objective instead of the widelyused Singular Valur Decomposition (SVD). A graphical representation of the RMDA compared with LDA is as follows.
The algorithm is implemented below. Notice here I skipped the Gradient Descent (GD) approach in the minimization during iterations and opt for the minimize
function in scipy.optimize
. I did so to make the structure simpler without hurting the understanding of the whole algorithm. For more detailed illustration please resort to the original paper.
Again we first define the class RMDA
. The predict
method now takes a matrix.
1  class RMDA: 
Then we train the model and print the final accuracy.
1  np.random.seed(2) 
Optimization converged successfully.Training accuracy: 87.00%
Further analysis and debugging should be expected. Any correction in comments is also welcomed. 😇
This is the fifth post on optimal order execution. Based on Almgren and Chriss (2000), today we attempt to estimate the market impact coefficient \(\eta\). Specifically, for highfrequency transaction data, we have the approximation \(dS = \eta\cdot dQ\) and thus can easily estimate it by the method of Ordinary Least Squares (OLS), using the message book data provided by LOBSTER.
We first explore the message book of Apple Inc. (symbol: AAPL
) from 09:30 to 16:00 on June 21, 2012.
1  import pandas as pd 
According to the instructions by LOBSTER, the columns of the message book are defined as follows:
1
means submission of a new limit order; 2
means Cancellation (partial deletion of a limit order); 3
means deletion (total deletion of a limit order); 4
means execution a visible limit order; 5
means Execution of a hidden limit order; 7
means Trading halt indicator (detailed information below)1
means means Sell limit order; 1
means Buy limit order1  message = pd.read_csv('data/AAPL_20120621_34200000_57600000_message_1.csv', header=None) 
time  type  id  size  price  direction  

0  34200.004241  1  16113575  18  585.33  1 
1  34200.025552  1  16120456  18  585.91  1 
2  34200.201743  3  16120456  18  585.91  1 
3  34200.201781  3  16120480  18  585.92  1 
4  34200.205573  1  16167159  18  585.36  1 
5  34200.201781  3  16120480  18  585.92  1 
6  34200.205573  1  16167159  18  585.36  1 
1  message_plce = message[message.type==1] 
Index(['time_x', 'type_x', 'id', 'size_x', 'price_x', 'direction_x', 'time_y', 'type_y', 'size_y', 'price_y', 'direction_y'], dtype='object')
1  df = message_temp[['id', 'time_x', 'time_y', 'size_y', 'price_x', 'direction_x']] 
(15099, 7)
Here I defined a function impact
to calculate the market impact (reflected on price deviation), such that for each successful execution, we calculate the price change after the same duration of the order.
1  def impact(idx): 
1  df['impact'] = [impact(i) for i in df.index] 
0  1  2  3  4  5  6  7  8  9  ...  2452  2453  2454  2455  2456  2457  2458  2459  2460  2461  

dQ  1.0  10.0  9.00  40.00  18.00  100.00  18.00  18.00  66.00  18.0  ...  100.00  19.0  10.0  90.00  10.00  40.00  50.00  1.00  100.00  100.00 
dS  0.2  0.2  0.03  0.19  0.07  0.09  0.21  0.03  0.05  0.0  ...  0.01  0.0  0.0  0.05  0.05  0.05  0.05  0.08  0.08  0.03 
1  fig = plt.figure(figsize=(14, 6)) 
1  res = sm.ols(formula='dS ~ dQ + 0', data=df_reg).fit() 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.148Model: OLS Adj. Rsquared: 0.148Method: Least Squares Fstatistic: 427.7Date: Sat, 12 May 2018 Prob (Fstatistic): 1.01e87Time: 14:02:16 LogLikelihood: 1535.7No. Observations: 2459 AIC: 3069.Df Residuals: 2458 BIC: 3064.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0005 2.37e05 20.680 0.000 0.000 0.001==============================================================================Omnibus: 2646.045 DurbinWatson: 1.287Prob(Omnibus): 0.000 JarqueBera (JB): 323199.873Skew: 5.154 Prob(JB): 0.00Kurtosis: 58.210 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Apparently there're several outliers that result in a low \(R^2\). Here we remove outliers that are lying outside three standard deviations.
1  df_reg_no = df_reg[((df_reg.dQ  df_reg.dQ.mean()).abs() < df_reg.dQ.std() * 3) & 
1  fig = plt.figure(figsize=(14, 6)) 
1  res = sm.ols(formula='dS ~ dQ + 0', data=df_reg_no).fit() 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.296Model: OLS Adj. Rsquared: 0.295Method: Least Squares Fstatistic: 1005.Date: Sat, 12 May 2018 Prob (Fstatistic): 1.45e184Time: 14:02:20 LogLikelihood: 2470.2No. Observations: 2397 AIC: 4938.Df Residuals: 2396 BIC: 4933.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0006 1.97e05 31.710 0.000 0.001 0.001==============================================================================Omnibus: 356.596 DurbinWatson: 1.108Prob(Omnibus): 0.000 JarqueBera (JB): 567.767Skew: 1.012 Prob(JB): 5.14e124Kurtosis: 4.259 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
So we conclude \(\hat{\eta}_{\text{AAPL}}=0.0006\) for the underlying timespan. However, what about other companies? The coefficients are expected to vary largely, which is though the very worst case we'd like to see.
We first define a function estimate
to automate what we've done above.
1  def estimate(symbol): 
The estimation for Microsoft Corp. (symbol: MSFT
) is as follows.
1  estimate('MSFT') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.229Model: OLS Adj. Rsquared: 0.228Method: Least Squares Fstatistic: 550.7Date: Sat, 12 May 2018 Prob (Fstatistic): 7.20e107Time: 14:04:51 LogLikelihood: 5732.8No. Observations: 1859 AIC: 1.146e+04Df Residuals: 1858 BIC: 1.146e+04Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 1.859e05 7.92e07 23.467 0.000 1.7e05 2.01e05==============================================================================Omnibus: 201.842 DurbinWatson: 0.778Prob(Omnibus): 0.000 JarqueBera (JB): 381.770Skew: 0.703 Prob(JB): 1.26e83Kurtosis: 4.719 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Amazon.com, Inc. (symbol: AMZN
) is as follows.
1  estimate('AMZN') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.294Model: OLS Adj. Rsquared: 0.293Method: Least Squares Fstatistic: 328.9Date: Sat, 12 May 2018 Prob (Fstatistic): 1.02e61Time: 14:06:56 LogLikelihood: 809.19No. Observations: 791 AIC: 1616.Df Residuals: 790 BIC: 1612.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0007 3.74e05 18.136 0.000 0.001 0.001==============================================================================Omnibus: 141.501 DurbinWatson: 1.022Prob(Omnibus): 0.000 JarqueBera (JB): 250.801Skew: 1.083 Prob(JB): 3.46e55Kurtosis: 4.709 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Alphabet Inc. (symbol: GOOG
) is as follows.
1  estimate('GOOG') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.419Model: OLS Adj. Rsquared: 0.418Method: Least Squares Fstatistic: 324.2Date: Sat, 12 May 2018 Prob (Fstatistic): 5.96e55Time: 14:07:20 LogLikelihood: 169.55No. Observations: 450 AIC: 337.1Df Residuals: 449 BIC: 333.0Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0017 9.57e05 18.005 0.000 0.002 0.002==============================================================================Omnibus: 48.913 DurbinWatson: 1.331Prob(Omnibus): 0.000 JarqueBera (JB): 61.896Skew: 0.864 Prob(JB): 3.63e14Kurtosis: 3.563 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Intel Corp. (symbol: INTC
) is as follows.
1  estimate('INTC') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.237Model: OLS Adj. Rsquared: 0.237Method: Least Squares Fstatistic: 444.2Date: Sat, 12 May 2018 Prob (Fstatistic): 4.52e86Time: 14:08:47 LogLikelihood: 4480.8No. Observations: 1429 AIC: 8960.Df Residuals: 1428 BIC: 8954.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 2.275e05 1.08e06 21.076 0.000 2.06e05 2.49e05==============================================================================Omnibus: 164.136 DurbinWatson: 0.716Prob(Omnibus): 0.000 JarqueBera (JB): 284.351Skew: 0.762 Prob(JB): 1.79e62Kurtosis: 4.566 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In sum, the market impact are generally significant but not leading to high \(R^2\) values, which suggests the linear assumption might be too strong. Also, it is noteworthy that \(\hat{\eta}\) does vary largely between companies (let alone industries or equity types), which means we cannot use one estimation as a benchmark for general production usage.
]]>Today we implement the order placement strategy in Almgren and Chriss (2000) s.t. for a certain order size \(Q\), we can estimate the probability to perform the optimal strategy in the paper within time horizon of \(T\).
It is tolerable^{[1]} in HFT that we assume stock price evolves according to the discrete time arithmetic Brownian motion:
\[\begin{cases}dS(t) = \mu dt + \sigma dW(t),\\\\dQ(t) = \dot{Q}(t)dt\end{cases}\]where \(Q(t)\) is the quantity of stock we still need to order at time \(t\). Now let \(\eta\) denote the linear coefficient for temporary market impact, and let \(\lambda\) denote the penalty coefficient for risks. To minimize the cost function
\[C = \eta \int_0^T \dot{Q}^2(t) dt + \lambda\sigma\int_0^T Q(t) dt\]
we have the unique solution given by
\[Q^*(t) = Q\cdot \left(1  \frac{t}{T^*}\right)^2\]
where \(Q\equiv Q(0)\) is the total and initial quantity to execute, and the optimal liquidation horizon \(T^*\) is given by
\[T^* = \sqrt{\frac{4Q\eta}{\lambda\sigma}}.\]
Here, \(\eta\) and \(\lambda\) are exogenous parameters and \(\sigma\) is estimated from the price time series (see the previous post) within \(K\) time units, given by
\[\hat{\sigma}^2 = \frac{\sum_{i=1}^n (\Delta_i  \hat{\mu}_{\Delta})^2}{(n1)\tau}\]
where \(\\{\Delta_i\\}\) are the first order differences of the stock price using \(\tau\) as sample period, \(n\equiv\lfloor K / \tau\rfloor\) is the length of the array, and
\[\hat{\mu}_{\Delta} = \frac{\sum_{i=1}^n \Delta_i}{n}.\]
Notice that \(\hat{\sigma}^2\) is proved asymptotically normal with variance
\[Var(\hat{\sigma}^2) = \frac{2\sigma^4}{n}.\]
Now that we know
\[\hat{\sigma}^2 \equiv \frac{16Q^2\eta^2}{\lambda^2 \hat{T}^4} \overset{d}{\to}\mathcal{N}\left(\sigma^2, \frac{2\sigma^4}{n}\right)\]
which yields
\[\frac{16Q^2\eta^2}{\lambda^2\hat{\sigma}^2\hat{T}^4}\overset{d}{\to}\mathcal{N}\left(1, \frac{2}{n}\right),\]
to keep consistency of parameters, with \(n\equiv \lfloor K/\tau\rfloor \to\infty\) we can also write
\[\frac{16Q^2\eta^2}{\lambda^2\hat{\sigma}^2\hat{T}^4}\overset{d}{\to}\mathcal{N}\left(1, \frac{2\tau}{K}\right).\]
with which we can estimate the probability of successful strategy performance. Specifically, the execution strategy is given above, and the expected cost of trading is
\[C^* =\eta \int_0^{T^*} \left(\frac{2Q}{T}\left(1  \frac{t}{T^*}\right)\right)^2 dt + \lambda\sigma\int_0^{T^*} Q\cdot \left(1  \frac{t}{T^*}\right) dt =\frac{4\eta Q^2}{3T^*} + \frac{\lambda \sigma QT^*}{3} = \frac{4}{3}\sqrt{\eta\lambda\sigma Q^3}.\]
1  import numpy as np 
(1.465147881156472, 0.8431842483948604)
which means there's a probability of 84.3% that we can perform our order placement strategy of size 10 within 3.6405 time units and minimize the trading cost of 1.47 at optimum.
How to estimate the parameters of a geometric Brownian motion (GBM)? It seems rather simple but actually took me quite some time to solve it. The most intuitive way is by using the method of moments.
First let us consider a simpler case, an arithmetic Brownian motion (ABM). The evolution is given by
\[dS = \mu dt + \sigma dW.\]
By integrating both sides over \((t,t+T]\) we have
\[\Delta \equiv S(t+T)  S(t) = \left(\mu  \frac{\sigma^2}{2}\right) T + \sigma W(T)\]
which follows a normal distribution with mean \((\mu  \sigma^2/2)T\) and variance \(\sigma^2 T\). That is to say, given \(T\) and i.i.d. observations \(\\{\Delta_1,\Delta_2,\ldots,\Delta_n\\}\) for different \(t\) values^{[1]}, with sample mean
\[\hat{\mu}_{\Delta} = \frac{\sum_{i=1}^n\Delta_i}{n}\overset{p}{\to}\left(\mu  \frac{\sigma^2}{2}\right)T\]
and modified sample variance
\[\hat{\sigma}_{\Delta}^2 = \frac{\sum_{i=1}^n (\Delta_i  \hat{\mu}_{\Delta})^2}{n1} \overset{p}{\to} \sigma^2 T,\]
we have unbiased estimator for \(\mu\)
\[\hat{\mu} = \frac{2\hat{\mu}_{\Delta} + \hat{\sigma}_{\Delta}^2}{2T}\]
and for \(\sigma^2\) we have
\[\hat{\sigma}^2 = \frac{\hat{\sigma}_{\Delta}^2}{T}.\]
Now we prove the consistency. First we consider the variance of \(\hat{\mu}_{\Delta}\)
\[Var(\hat{\mu}_{\Delta}) = \frac{Var(\Delta_1)}{n} = \frac{\sigma^2 T}{n}\]
and the variance of \(\hat{\sigma}_{\Delta}^2\)
\[Var(\hat{\sigma}_{\Delta}^2) =E(\hat{\sigma}_{\Delta}^4)  E(\hat{\sigma}_{\Delta}^2)^2 =\frac{n E[(\Delta_1\hat{\mu}_{\Delta})^4] + n(n1) E[(\Delta_1\hat{\mu}_{\Delta})^2]^2}{(n1)^2}  \sigma^4T^2 =\frac{2\sigma^4T^2}{n}.\]
The variance of \(\hat{\mu}\) is therefore given by
\[Var(\hat{\mu}) =\frac{4Var(\hat{\mu}_{\Delta}) + Var(\hat{\sigma}_{\Delta}^2)}{4T^2} =\frac{\sigma^2 (2 + \sigma^2T)}{2nT}\]
and the variance of \(\hat{\sigma}^2\) is given by
\[Var(\hat{\sigma}^2) =\frac{Var(\hat{\sigma}_{\Delta}^2)}{T^2} =\frac{2\sigma^4}{n}.\]
So the two estimators are also both consistent. It should be noticed that there exists certain "tradeoff" between the efficiency of \(\hat{\mu}_{\Delta}\) and \(\hat{\sigma}_{\Delta}^2\) by varying the value of \(T\).
For a general GBM with drift \(\mu\) and diffusion \(\sigma\), we have PDE
\[\frac{dS}{S} = \mu dt + \sigma dW,\]
so we can integrate^{[2]} the both sides within \((t,t+T]\) for any \(t\) and get
\[\Delta \equiv \ln S(t+T)  \ln S(t) = \left(\mu  \frac{\sigma^2}{2}\right) T + \sigma W(T).\]
The rest derivation is exactly the same.
Now we numerically validate this against monte Carlo simulation.
1  import numpy as np 
Statistics  monte Carlo  Method of moment  P Value 

E(mu_hat)  1.994533e03  2.000000e03  0.222191 
Var(mu_hat)  4.010866e07  3.924000e07   
E(sigma2_hat)  3.596733e03  3.600000e03  0.201573 
Var(sigma2_hat)  1.308537e07  1.296000e07   
Now we may safely apply this estimation in application.
Here I'm trying to write something partly based on Cont's first model in the previous post. I plan to skip the Laplace transform and go for Monte Carlo simulation. Also, I'm trying to abandon the assumption of unified order sizes. To implement that, I need to shift from a Markov chain which is supported by discrete spaces, onto some other stochastic process that is estimatable. Moreover, although I actually considered supervised learning for this problem, I gave it up at last. This is because my model is inherently designed for high frequency trading and thus training for several minutes each time would be intolerable.
1  import smm 
I need smm
for multivariate stochastic processes, and scipy.optimize
for maximum likelihood estimation.
1  def retrieve_data(date): 
time  ask_price_1  ask_price_10  ask_price_100  ask_price_101  ask_price_102  ask_price_103  ask_price_104  ask_price_105  ask_price_106  ...  bid_vol_90  bid_vol_91  bid_vol_92  bid_vol_93  bid_vol_94  bid_vol_95  bid_vol_96  bid_vol_97  bid_vol_98  bid_vol_99  

1  20180129 00:00:06.951631+08:00  12688.00  12663.58  12391.48  12390.00  12389.96  12388.00  12384.22  12381.39  12380.00  ...  6.0  15.0  1.0  460.0  4.0  121.0  5.0  1.0  5.0  120.0 
2  20180129 00:00:07.792882+08:00  12676.93  12657.04  12391.48  12390.00  12389.96  12388.00  12384.22  12381.39  12380.00  ...  1.0  400.0  363.0  5.0  6.0  15.0  1.0  460.0  4.0  121.0 
3  20180129 00:00:08.702945+08:00  12643.27  12617.26  12361.27  12360.00  12359.38  12358.06  12356.22  12355.44  12354.17  ...  6.0  15.0  1.0  460.0  4.0  121.0  5.0  1.0  5.0  120.0 
4  20180129 00:00:10.998615+08:00  12666.00  12642.73  12380.00  12377.00  12374.99  12369.73  12366.43  12365.84  12361.45  ...  460.0  4.0  121.0  5.0  1.0  5.0  120.0  150.0  12.0  97.0 
5  20180129 00:00:11.742304+08:00  12674.00  12643.27  12384.22  12381.39  12380.00  12377.00  12374.99  12369.73  12366.43  ...  4.0  121.0  5.0  60.0  1.0  5.0  120.0  150.0  12.0  97.0 
Larger index means smaller values for both bid and ask prices. It's uncommon and here I reindexed the variables s.t. bid_1
and ask_1
corresponds with the best opponent prices.
1  def rename_index(s): 
1  variables = list(data.columns[1:]) 
I dropped the time
variable simply because I don't know how to use it. Normally there're two ways to handle uneven timegrids: resampling and ignoring, and I chose the latter.
1  def plot_lob(n, t, theme='w'): 
Now we make a plot of the order book within the past 10 steps, including 20 bid levels and 20 ask levels.
1  n, t = 20, 10 
Not sure if it tells any critical information. Let's make another plot. This time \(t=500\) and we only consider the best bid and ask orders.
1  fig = plt.figure(figsize=(12, 6)) 
1  price = data[[f'bid_price_{i}' for i in range(n,0,1)] + [f'ask_price_{i}' for i in range(1,n+1)]] 
A simple idea would be inputting the prices and volumes in the current orderbook, and predict the future mid prices. Furthermore, it's ideal to have a rough expectation of the minimum time that the mid price crosses a certain price, or the time needed in expectation before my order got executed successfully.
1  change = [] 
The calculation of change
took over 10 minutes. I don't think it's gonna be useful in real work. However, it's not so bad an idea to save it somewhere locally in case I need it later.
1  change = pd.DataFrame(np.array(change).astype(int), columns=vol.columns) 
1  change = pd.read_csv(f'data/change_{date}.csv', index_col=0) 
After some research, I decided to fit the data in change
to student's tdistribution, Skellam distribution, and twoside Weibull distribution. I'll now elaborate reason why I chose, and how to estimate each distribution below.
First is the tdistribution. It is wellknown for its leptokurtosis which suits well in many financial time series as a better alternative to Normal distribution. The PDF and CDF of the tdistribution involves the Gamma function and thus would be computationally troublesome when we want to calculate the MSE of the parameters. However, notice for any r.v. \(X\sim t(\nu,\mu,\sigma)\), we have relationship
\[\text{Var}(X) = \begin{cases}\frac{\nu}{\nu  2} & \text{for }\nu > 2,\\\infty & \text{for }1 < \nu \le 2,\\\text{undefined} & \text{otherwise}\end{cases}\]
and
\[\text{Kur}_+(X) = \begin{cases}\frac{6}{\nu  4} & \text{for }\nu > 4,\\\infty & \text{for }2 < \nu \le 4,\\\text{undefined} & \text{otherwise}\end{cases}\]
where \(\text{Kur}_+\equiv \text{Kur}  3\) is the excess kurtosis, we can simply go for moment estimation for tdistribution using empirical variance or kurtosis.
Second, the Skellam distribution. This is mainly due to the original model used in Cont's paper, where he assumes Poisson order arrivals uniformly over the time. Here I slightly improve the model s.t. bid and ask orders are modelled in the same time and represented by r.v. \(S\equiv P_a  P_b\) where \(P_a\sim Pois(\lambda_a)\) and \(P_b\sim Pois(\lambda_b)\). This is therefore a discrete distribution with two parameters. scipy.stats
has its PMF implemented and all I need to do is numerically maximize the likelihood.
For the twosided Weibull distribution, it is given by
\[Y = \begin{cases}\text{Weibull}(\lambda_1, k_1) & \text{if } Y < 0,\\\text{Weibull}(\lambda_2, k_2) & \text{otherwise}\end{cases}\]
where shape parameters $k_{1,2}0 $ and scale parameters \(\lambda_{1,2} > 0\).
Therefore, the pdf is
\[f(y \mid \lambda_1, k_1, \lambda_2, k_2) = \begin{cases}\left(\frac{y}{(\lambda_1)}\right)^{k_1 1}\exp\left(\left(\frac{y}{(\lambda_1)}\right)^{k_1}\right) & \text{if y < 0},\\\left(\frac{y}{(\lambda_2}\right)^{k_2 1}\exp\left(\left(\frac{y}{(\lambda_2)}\right)^{k_2}\right)& \text{otherwise}\end{cases}\]
and to normalize the integration to \(1\), we also have
\[\frac{\lambda_1}{k_1} + \frac{\lambda_2}{k_2} = 1 \Rightarrow \lambda_2 = k_2 (1  \lambda_1 / k_1)\]
which means there're in fact only three parameters to estimate.
Now we rewrite the loglikelihood as
\[\begin{align\*}LL = \sum_{i=1}^n \log(f(y_i))= \sum_{i=1}^n &\left((k_11)(\log^\*(y_i)  \log^\*(\lambda_1))  (y_i / \lambda_1)^{k_1}\right)\mathbb{I}_{y_i < 0} + \\ &\left((k_21)(\log^\*(y_i)  \log^\*(\lambda_2))  (y_i / \lambda_2)^{k_2}\right)\mathbb{I}_{y_i \ge 0}.\end{align\*}\]
where we have the special \(\log^*(y)\equiv 0\) if \(y\le0\).
1  i = 15 # take ask_15 for example 
As coded above, at last I didn't include twosided Weibull distribution because the optimization did not converge. In conclusion, for changes of order sizes (denoted by \(x\)), we use modified tdistribution with
\[\hat{\mu} = \bar{x},\quad \hat{\sigma} = 0.3 \cdot \sqrt{\widehat{\text{Var}}(x)} + 0.7 \cdot \sqrt{\left(2  \frac{6}{6 + 2\ \widehat{\text{Kur}}_+(x)}\right)}\]
and
\[\hat{\nu} = \frac{6}{\widehat{\text{Kur}}_+(x)} + 4\]
where
\[\widehat{\text{Kur}}_+(x) = \widehat{\text{Kur}}(x)  3\]
while
\[\widehat{\text{Kur}}(x) = \hat{m}_4(x) / \hat{m}_2^2(x)\]
and
\[\hat{m}_4 = \sum_{i=1}^n (x_i  \bar{x})^4 / n,\quad \hat{m}_2 = \sum_{i=1}^n (x_i  \bar{x})^2 / n.\]
Now, when we assume independence across different buckets of order book, we can estimate the parameters of tdistributions as below.
1  params = np.zeros([2 * n, 3]) 
array([[ 5.1589201 , 0.52536232, 11.05729 ], [ 5.86412495, 0.61454545, 12.08484143], [ 5.82376701, 4.61231884, 11.67543236], [ 6.28819815, 0.7173913 , 10.85941723], [ 6.89178374, 1.59927798, 11.25140225], [ 6.14231284, 2.29856115, 12.46686452], [ 6.4347771 , 2.22302158, 13.73785226], [ 6.17737187, 0.67753623, 12.19098061], [ 5.9250571 , 1.68231047, 12.54472066], [ 5.16886809, 0.69090909, 11.94199489], ... [ 5.94772822, 3.18181818, 12.4415555 ], [ 6.5157695 , 4.62181818, 13.67098387], [ 6.69385395, 0.66304348, 13.63770319], [ 4.99329442, 1.11510791, 11.63780506], [ 5.04144977, 1.91756272, 11.20026029], [ 5.47054269, 4.34163701, 10.66971035], [ 5.11684414, 2.35460993, 9.98656422], [ 4.89130697, 1.07092199, 11.5511127 ], [ 5.31202782, 0.58865248, 11.01769165], [ 5.17908162, 2.16961131, 10.81368767]])
When we do not ignore the correlation across all buckets, a multivariate tdistribution must be considered. Similar to multivariate Normal distributions, here we need to estimate a covariance matrix, a vector of expectations and a vector of degrees of freedom. Notice the degrees of freedom do not vary significantly across the rows in params
, to accelerate computation I set a unified degree of freedom for all buckets, namely \(df = 7\). Using Expectation Maximization (EM) algorithm introduced by D. Peel and G. J. McLachlan (2000), I wrote the model below to estimate this distribution.
1  class MVT: 
Now the distribution for order size movement is estimated. We can simulate the trajectory and rebuild the order book in future several steps. Specifically, notice the predicted movement may well change the shape of the order book while, according to practical observation, the order book retains its "V"shape in most of the time. Therefore, I sort up separately both halves of the order book every time they're updated by a predicted order size movement (or "comovement", since it should be a vector).
1  n_steps = 20 
Below is a simple sketch of this order book trajectory where I assign stronger color to the traces that are closer to the best (bid/ask) prices.
1  fig = plt.figure(figsize=(12, 6)) 
It can be seen from the figure that stronger traces are located more to the bottom, which validates our intuition since trades around the current price are more active than those to the left or the right of the order book.
With this prediction procedure implemented, we can estimate the probability of our order (placed at the price bucket order_idx
with size order_size
) being executed within n_steps
.
1  n_steps = 10 
0.861
So a limit buy order at bid_8
(\(20  12 = 8\)) with size 100 can be executed within 10 steps, at a probability of 86.1%. Moreover, we can even make a 3D surface plot to get a comprehensive idea of the whole distribution.
1  def evolve(order_idx, order_size, n_steps=10, n_sim=1000): 
Today, I'll continue introducing papers about optimal order execution and particularly, in this post I'll mainly walk through six papers by Rama Cont, respectively in 2010 and 2018. Professor Cont is renowned for his indepth research in stochastic analysis, stochastic processes and mathematical modeling in quantitative finance. He's written dozens of papers concerning the order book dynamics by building rigorous mathematical models.
In this classic paper, the authors tried to model realworld order book as a discretetime Markov chain. The order book is evenly divided into several buckets of prices, where order sizes are recalculated s.t. positive sizes represent ask orders, and negative sizes represent bid orders. Let's denote this order book by \(\boldsymbol{x}\in\mathbb{Z}^n\). Also, let \(\boldsymbol{x}_{p\pm 1} \equiv \boldsymbol{x} \pm \boldsymbol{e}^p\) where \(\boldsymbol{e}^p\in\mathbb{Z}^n\) is the \(p\)th base vector. Denote the best ask and bid prices by \(p^a\) and \(p^b\). By assuming unitsized orders^{[1]} and conditioning on the inflow of new orders, the Markov state transitioning can be described as below:
Furthermore, the authors assumed stationary Poisson arrivals for these inflows in each bucket. Arrival rate for limit orders \(\lambda(p)\) is an increasing function when \(p\) is smaller than the current price, and is decreasing when \(p\) is larger than the current price. Arrival rate for market orders is assumed to be constant \(\mu\), and arrival rate for order cancellations should by assumption be proportional to the current order size in the underlying bucket of the book.
Therefore, we have
The empirical performance of onestep ahead prediction is illustrated below.
It is easy to recognize that the underlying random walk is a birthdeath process. Hence, we may opt for Laplace transforms to calculate the firstpassage times of our model, i.e. the time before our order is successfully executed given that the mid price hasn't moved.
In this paper of Cont, he tried to model untrahigh frequency (UHF) order book using fluid and diffusion models.
Regime  Time scale  Issues 

Untrahigh frequency  \(\sim 10^{3}  1\) s  Microstructure, latency 
High frequency  \(\sim 10  10^2\) s  Optimal execution 
Daily  \(\sim 10^3  10^4\) s  Trading strategies, option hedging 
By going from UHF to even more ideal data where we assume tick size \(\to\) 0, order arrival frequency \(\to\infty\) and order size \(\to\) 0, we may apply multiple asymptotic theorems to analyze the order book dynamics in this very extreme case. Different combinations of scaling assumptions are possible for the same process and might lead to very different limits. Specifically, when we assume that variances vanishes asymptotically, the limit process is thus deterministic and often given by a PDE or ODE. This functional equivalent of Law of Large Numbers is called "fluid" or "hydrodynamic" limit, e.g.
\[\lambda_n^i\sim n\lambda^i,\quad \left(\frac{N_1^n  N_2^n}{n}, t\ge 0\right) \overset{n\to\infty}{\to} ((\lambda^1  \lambda^2)t, t\ge 0).\]
Other scaling assumption can lead to a totally different limit, e.g. "random" or "diffusion" limit:
\[\lambda_n^i\sim n\lambda, \quad \lambda_n^1  \lambda_n^2 = \sigma^2\sqrt{n},\quad \left(\frac{N_1^n  N_2^n}{\sqrt{n}}\right)\overset{n\to\infty}{\to}\sigma W.\]
Similar to the first paper, here Cont and de Larrard tried to model order books as a Markov chain where limit orders, market orders and order cancellations arrives following stationary Poisson processes. Specifically, in this paper, the arrival rate of limit orders is constant. This is assumed in the hope of deriving closeform results analytically. Between each stock price change, we have \(q_t^a\) and \(q_t^b\) are independent birthdeath processes with birth rate \(\lambda\) and death rate \(\mu+\theta\). Define \(\sigma^a\) as the firstpassage time for process \(q_t^a\). Similarly define \(\sigma^b\). Then, the time duration before the next price move is given by \(\tau = \min\\{\sigma^a, \sigma^b\\}\).
Conditional Laplace transform of \(\sigma^a\) solves
\[\mathcal{L}(s, x) = \text{E}(\exp(s\sigma^a)\mid q_0^a = x) = \frac{\lambda \mathcal{L}(s, x+1) + (\mu+\theta)\mathcal{L}(s,x1)}{\lambda+\mu+\theta+s}\]
which eventually gives
\[\mathcal{L}(s, x) = \left(\frac{(\lambda + \mu + \theta + s)  \sqrt{(\lambda + \mu + \theta + s)^2  4 \lambda (\mu + \theta)}}{2\lambda}\right)^x.\]
The distribution of \(\tau\) conditional on the current queue length is
\[\text{P}(\tau > t\mid q_0^a = x, q_0^b = y)= \text{P}(\sigma^a > t\mid q_0^a = x) \text{P}(\sigma^b > t\mid q_0^b = y)= \int_t^{\infty} \hat{\mathcal{L}}(u, x) du\int_t^{\infty} \hat{\mathcal{L}}(u, y) du\]
where the inverse Laplace transform \(\hat{\mathcal{L}}\) is given by
\[\hat{\mathcal{L}}(t, x) = \frac{x}{t} \sqrt{\left(\frac{\mu + \theta}{\lambda}\right)^x} I_x\left(2t\sqrt{\lambda(\theta + \mu)}\right)\exp(t(\lambda + \mu + \theta)).\]
In the case of heavytraffic queueing systems, the order books tend to show "diffusive" dynamics. In the most extreme scenario \(\lambda = \mu + \theta\), we have
\[\left(\frac{s_t(n\log n)}{\sqrt{n}}\right)_{t\ge 0} \overset{d}{\to} \sqrt{\frac{\pi \lambda \delta^2}{D(f)}B}\]
where \(B\) is a Brownian motion, and
\[D(f)\equiv \left(\int_{\mathbb{R}_+^2} xy\ dF(x, y)\right)^{1/2}\]
is the geometric mean of the bid and ask queue lengths. This directly gives a diffusion process with variance
\[\sigma^2 = \delta^2 \frac{\pi\lambda}{D(f)}.\]
Interestingly, the formula does not require checking the stock price to estimate the volatility. Instead, all information it needs is from order flow statistics: arrival rate \(\lambda\), and average order size \(D(f)\).
Deep learning is leading the fashion in academia. In the justfinished 6th ImperialETH Workshop in Mathematical Finance, Cont introduced his LongShortTerm Memory (LSTM) network built with Sirignano that claimed to have defeated a range of other wellstudied mathematical models (see below for the network structure). The model takes historical order book states as input and predicts next price moves. Specifically, they used the historical data from approximately 1,000 stocks traded on NASDAQ and trained the network asynchronously on over 500 GPUs. Results show significant improvement on prediction accuracy by introducing longterm memory into the model and moreover, a tendency of universal effectiveness even for stocks out of the sample.
1  # create virtualenv myenv 
It is supported by the package hexofilterflowchart
:
1  npm install save hexofilterflowchart 
You can configure your flowchart layout in the site's _config.yml
:
1  flowchart: 
Raw code like
1  s=>start: log in; counter := 0 
in a flow
code block, produces a nice flowchart as below:
I'm recently reading several papers on modelling stock order behavior and its corresponding optimal strategies. Compared with classical quantitative trading strategies, serious research on order optimization has a short history as its major importance dazzles in high frequency trading (HFT), one of the youngest childs of finance. I'll wrap up the main ideas and methodologies below in a paperwise manner. Further unregular updates on this post are expected. Specifically, the models below are trying to answer this question: how much time should I expect before my limit order gets executed?
This is the lecture note for the course MFE230X at UC Berkeley. The note started from the motivation of the source of trading costs and a range of liquidity measures, including static ones like (realized) spreads, TWAP, volume, VWAP and POV, together with a dynamic measure, IS. With stock price \(S(t)\) following some predetermined evolution model and control variable \(Q(t)\), quantity yettotrade up to time \(t\) (same direction, boundary conditions \(Q(0)=Q\) and \(Q(T)=0\)), the authors compared several strategies with the objective of minimizing the cost function over time. Specifically, it's proved that the following four strategies are both statically and dynamically optimal.
This is a model with market impact combined with "urgency in execution". (quoted from the lecture notes) The model assumes that stock price follows a diffusion process. The market impact \(dS(t) \equiv \eta Q'(t)\) on stock price, by assumption, decays instantaneously. Therefore, the cost function is
\[C_0 = \eta \int_0^T Q'(t)^2 dt.\]
Using EulerLagrange equation and boundary conditions, the optimal strategy is
\[Q(t) = Q\left(1  \frac{t}{T}\right).\]
Next, they added a penalty term for cost variance^{[1]}
\[\text{Var}(C_0) = \text{Var}\left(\sigma^2\int_0^T Q(t)dS(t)\right) = \sigma^2 \int_0^T Q(t)^2dt\]
and the new riskadjusted cost function is thus
\[C_1 = C_0 + \lambda \text{Var}(C_0) = \eta \int_0^T Q'(t)^2 dt + \lambda \sigma^2 \int_0^T Q(t)^2 dt\]
which gives optima, with \(\kappa\equiv\lambda\sigma^2/\eta\)
\[Q(t) = Q\frac{\sinh\kappa(Tt)}{\sinh(\kappa T)}.\]
This statically optimal solution is dynamically optimal and optimal liquidation time is \(\infty\).
Alternatively, we can also penalize average VaR instead of variance, and the cost function now becomes
\[C_2 = \eta \int_0^T Q'(t)^2 dt + \lambda\sigma\int_0^T Q(t)dt.\]
Now the optimal strategy is
\[Q(t) = Q\left(1  \frac{t}{T}\right)^2\]
where
\[T = \sqrt{\frac{4Q\eta}{\lambda \sigma}}.\]
The solution is again also dynamically optimal. Note that now we have finite liquidation time.
They assumed arithmetic brownian motion (ABM)^{[2]} and took timeaveraged VaR as the risk term, ceteri paribus. The solution is
\[Q_{ABM}'(t) = \frac{2Q}{T}\left(1  \frac{t}{T}\right)\]
and the optimal liquidation time is
\[T = \sqrt{\frac{4Q}{\lambda S(0)}}.\]
Additionally, the solution under GBM assumption is
\[Q_{GBM}'(t) = \frac{Q_{GBM}(t)}{T  t} + \frac{QS(t)}{T^2S(0)}(T  t).\]
It is worth noting that the two settings are not significantly different when \(\sigma^2T\ll 1\). Another interesting result is that in AlmgrenChriss style models, we can prove VWAP is always the optimal strategy.
In the paper by Obizhaeva and Wang, the price process is modelled with a resilience term:
\[S(t) = S(0) + \eta \int_0^t\exp(\rho(ts))Q'(s)ds + \sigma\int_0^t dZ(s)\]
which, instead of market impact proportional to actual quantity on the book, assumes impact linear in the rate of trading, or more specifically, the exponentially discounted quantity finished over the time. The expected cost is thus given by
\[C= \eta \int_0^T Q'(t)dt \int_0^t \exp(\rho(ts))Q'(s)ds.\]
To find the statically optimal policy, we have EulerLagrange:
\[\frac{d}{dt}\frac{\partial C}{\partial Q'} = 0 \Rightarrow \frac{\partial C}{\partial Q'} = K\]
where \(K\) is a constant. Functionally differentiating \(C\) w.r.t. \(Q'\) we obtain the Fredholm integral equation:
\[\frac{\partial C}{\partial Q'}(t) = \int_0^T\exp(\rho ts) Q'(s) ds = K\]
which gives a candidate bucket policy
\[Q'(t) = \frac{K}{2}(\delta(t) + \rho + \delta(tT))\]
In the paper by Alfonsi, Fruth and Schied^{[3]}, the aforementioned model by Obizhaeva and Wang was further analyzed. A more general policy was proposed which handles any shape of order density function while keeping the original assumption of exponentially recovery scheme. Specifically, their models split the continuous ordering process into \(N\) smaller consecutive buckets (so in total \(N + 1\) orders). Here I'll only cover the optimal strategy from the first model in their paper.
With some distribution \(F\) for the ordering process given, suppose we have a function \(h:\mathbb{R}\to\mathbb{R}^+\) with
\[h(x) = F^{1}(x)  \exp(\rho T / N) F^{1}(\exp(\rho Tx / N))\]
is onetoone. Then there exists a unique optimal strategy \(\left\{Q_0, Q_1, \ldots, Q_N\right\}\), where the initial order is the unique solution of
\[F^{1}(Q  NQ_0(1  \exp(\rho T / N))) = \frac{h(Q_0)}{1  \exp(\rho T / N)},\]
the intermediate orders are given by
\[Q_1 = Q_2 = \cdots = Q_{N  1} = Q_0 (1  \exp(\rho T / N))\]
and the final order is determined by
\[Q_N = Q  Q_0  (N  1)Q_0(1  \exp(\rho T / N)).\]
Different from any paper above, here we introduce nonlinear transient market impact^{[4]}. There is a famous squareroot liquidity model by Grinold and Kahn:
\[\Delta S = \text{Spread Cost} + \alpha \sigma \sqrt{\frac{\text{Trade Size}}{\text{Daily Volume}}}.\]
The squareroot price process, therefore, is
\[S(t) = S(0) + \frac{3}{4}\sigma\int_0^t\sqrt{\frac{Q'(t)}{V}}\frac{ds}{\sqrt{ts}} + \sigma\int_0^t dZ(s)\]
where we've assumed a kernel \(G(t)=t^{1/2}\). Using this market impact model, we may numerically compare different execution schedulings under a more empiricallytested framework. However, due to the concavity of \(\sqrt{Q'(t)/V}\), there is not optimal solution for this model.
I've been using the welldeveloped NexT theme for this Hexo blog since I built it. The old theme was fantastic, with a broad header, strong local search, neat, animated twocolumn layout and various comment systems including valine^{[1]} and gitment^{[2]}. However, it's a twoedged sword to have so many features binded, and more importantly, presented right within such a simple layout. Eventually, I realized that I actually don't need the spacious header or the animated sidebar (I do, though, need local search, and we'll return to that below). After exhaustively browsing the Hexo theme list, eventually I chose Apollo.
Pull from the master's branch on GitHub.
1  npm install save hexorendererjade hexogeneratorfeed hexogeneratorsitemap hexobrowsersync hexogeneratorarchive 
Then, in the _config.yml
under Hexo's root directory,
1  theme: apollo 
and match your logo, favicon and Google Analytics API in the _config.yml
under theme/Apollo
.
There's no Gitment support in original Apollo and I have to manually embed it. First, apply here for a new OAuth Application. Make sure the callback URL
is the url of the blog. You'll get a client ID
and client secret
after successfully finish this step. Next, in theme's _config.yml
, add
1  gitment: 
In layout/partial/comment.jade
, add
1  if theme.gitment.enable 
You can now log into your GitHub account and initialize the comments on any page of post.
The powerful local search is probably my favorite feature on theme NexT, but Apollo does not support such functions. I referred to this repo which already implemented this feature using Tipue Search. To successfully enable local search, go to the root directory of your blog and
1  npm install hexogeneratortipuesearchjson save 
In the index.md
you just generated, add
1  <form id="searchform" style="textalign:center;"> 
There are also several things I customized, e.g. color palette, font families and post width. They are not trivial if you're really keen on turning your blog into a feast of the eyes but I won't discuss in detail which properties I changed (it's pointless remember all of them). They're all in the source/css/apollo.css
and all you need is some basic knowledge about HTML. However, I do want to recommend the CSS Format package of Sublime Text. It allows you switch painlessly between compact, compressed, expanded CSS styles and even more. It literally saved my day.
Strategy running on Amazon EC2 and portfolio value pulled dynamically to plot this using plotly
. Timezone is set at my local time, i.e. Europe/Amsterdam
. Cheers!
Update April 11:
Strategy offline before I realized my AWS bills was being owed and the server down due to that... Performance excl. and incl. loss due to holding costs and enforcing execution, etc. (which I would call "true performance" but is actually summarized from two accounts) is compared below.
Gross Performance  Adjusted Performance  

Timespan in Days  78.72  78.72 
Cumulative Returns  +1386.17%  +651.53% 
Annualized Returns  +6431.49%  +3022.94% 
Sharpe Ratio  7.35  7.06 
Maximum Drawdown    8.05% 
Update June 3:
Since the mass slump this January, the strategy has experienced a remarkable decline in performance. Could be due to market panic and the corresponding flooding. Now that it's no longer profitable, I'm sharing part of codes in the main file as below. Feel free to tell me your suggestions in comments!
1  # Author: Allen Frostline / Version 0.1.8 
Coinbase has just launched BCH as an alternative to BTC because of the high transaction fees and annoying delays of the latter one. As a result, the exchange rate of USD/BCH rocketed tenfolds within 24 hours. Together with BTC, mainstream altcoins like ETH and ETC faced significant slump since a week ago. On contrary, XRP, because of the rumored endorcement from big companies like Amazon, is experiencing a straight appreciation of almost 100% since two weeks ago.
This is the cryptocurrency market. Notorious for its incredible risks with comparatively high volatility and liquidity (within the exchange, while for the special case of interexchange liquidity of BTC markets, you can check out this previous post of mine), the huge market is also globally desirable for any quant trader  well maintained APIs, low fees, free information including 3rdlevel order books, and various order types allowing for any kind of derivative you'd like to give a try. Earlier this year, I took a brief look into this market and its varied properties did attract me. Different from then when I mainly focused on the inefficiency of the market as a whole, I'd like to write some real stategies this time, as basically all my grad school applications are finished and there's a wonderful holiday waiting for me. I'll have enough time to build my own wheel, develop and test my lovely strategies.
There're already some codes being written since yesterday, I don't intend to present them here as they're highly unstable and need way more tests and perfection. However, I do have a framework for now, which I'd like to write down here so that I can proceed everything in a more orderly way. This is gonna be a series, as I've already put in the title, and hopefully I can finish everything below by 2018.
This part is actually already finished. This is a framework (for now) specifically designed for the Poloniex API which requires private keys so that the bot can send real orders. Due to the fact that there's no market order in Poloniex's API, I had to write my own based on limit orders and order book information. This is now still a commandline version, which of course can be easily embedded into a webbased or desktop software, but I don't think that's of any practical necessity now. ANSI colors are enough for efficient while enjoyable monitoring. The only issue now is the speed of the system. However, the delays are almost 95% from web data requests rather than the strategies themselves, and thus rewriting the whole system into a C++ one seems not helping either.
The first strategy is almost done and is currently under nontrading testing. It is based on a LongShortTermMemory neural network with an uncertain number of sigmoid activation layers. Although I named it as forecasting, it is in fact not a regression. Rather, I hotcoded the optimum weighted portfolio posteriorly determined by the historical Sharpe ratio on a minute basis. By doing so, I successfully translated the multivariate regression problem (which is, according to my experience, fairly hard for most models) into a categorical classification problem. By using a sigmoid activation layer for outputs I can interpret/use the classification probabilities as the predicted weight vector and validate it using loss functions like MSE or crossentropy. The result is interesting, and I'll keep it for the next post.
I got this inspiration from the socalled "triangular arbitrage" in foreign exchange markets. Because however transparent information is across the whole market, traders cannot access and digest everything at the very same time, price discrepancy occurs from time to time. By trading USD for EUR and then for CNY and then back for USD, due to such discrepancy in between, arbitrage opportunities may exist. However, as this is quite classic a strategy, of course thousands of traders are already doing this sort of thing in the cryptocurrency market, and thus for length 4 (including both beginning and ending currencies) paths, arbitrage opportunities are only too transient to be caught. Based on this idea, by extending the trading path's length to a larger but bounded, uncertain number, I expect to find more such opportunities with a sacrifice of higher slippage. I'm considering whether theories about Markov chains and graph theory can be applied in this story but have not yet figured out any concrete idea.
As the name implies, this is merely the combination of strategy 1 and 2. The insight behind it is also quite simple  by basing our trasactions on a coin, e.g. BTC, we're binding our excess returns to the absolute performance of the coin itself. There're two solutions for this dilemma:
I cannot open a margin account, period.
Merry Christmas!
]]>This is part of the preliminary data analysis of a course project. Data is collected from the National Statistical Beureau of China and in the unit of thousand tons. In the beginning, I start with a comprehensive table called data
, where for each province there's a matrix of 20 by 31, i.e. transport to each province w.r.t. different years. Therefore, it's convenient to extract the actual transient matrix for each year by using the method groupby
of Pandas
.
The most important one is networkx
in this project.
1  import warnings 
There're two functions, draw_networkx_edges_with_arrows
for directed edges drawing, plot_network
for whole networks drawing.
1  def draw_networkx_edges_with_arrows(G, pos, width, edge_color, alpha=0.5, ax=None): 
As I mentioned in the beginning, here the transient matrices are extracted by groupby
. For further usage, we save then by specifying the parameter save
in plot_network
to be True
.
1  rail = data.ix[:,6:1].fillna(0).astype(int).groupby(data.Year) 
Key package here is imageio
. I've set the limit for maximum pixel to 1e10 in case there's any overflow due to large figures in the above steps.
1  import imageio 
Well, I have to say it looks gorgeous.
]]>In this research paper we try to use as much information on a stock as we can on Ricequant, to train a robust binary classifier for expected returns on a rolling basis. As an extra, we create a brandnew accuracy metric based on behavioral economics for model traing, which enhanced the fitting of the models (in the language of classical metrics, e.g. accuracy or pricision scores) by 3 to 5 times. The advantage of this new metric will be covered in the corresponding sector.
First we import the necessary packages we're going to use later.
1  %config InlineBackend.figure_format = 'retina' 
Global configurations.
1  pool = index_components('000050.XSHG') 
Load the raw data and encapsulate into a Pandas panel.
1  today = datetime.today() 
Some further data investigation.
Unshifted data for training:
1  for f in range(17,78): X_.ix[:,f] = X_.ix[:,f].astype('category') 
Shift the data so that they corresponds with
\[y_i = clf(X_i).\]
1  X = X_.ix[:1,:] 
(60, 78)(60,)1 320 28dtype: int64
1  y.describe() 
count 60.000000mean 0.533333std 0.503098min 0.00000025% 0.00000050% 1.00000075% 1.000000max 1.000000dtype: float64
1  X.describe(include=['number']) 
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  

count  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  60.000000  6.000000e+01  60.000000  60.000000 
mean  0.976212  25.990400  25.656650  24.940308  22.879636  21.834747  20.977319  39.063571  35.434052  0.852667  67.581394  26.807436  25.990400  25.173364  1.039211e+09  0.723869  24.547501 
std  0.015776  2.827868  2.770969  2.614787  1.668471  1.437906  1.220115  7.954479  7.286649  0.354867  7.811732  3.090838  2.827868  2.616494  1.263014e+08  0.168018  4.961348 
min  0.948008  22.640000  22.085000  21.392000  20.650167  19.762876  19.285863  25.863478  26.096620  0.413193  54.834433  22.887871  22.640000  22.083778  8.750972e+08  0.527858  16.936824 
25%  0.965606  23.293000  23.043000  22.981875  21.499667  20.572173  19.949996  32.360567  27.708563  0.528046  61.687466  23.800034  23.293000  22.799033  9.242619e+08  0.578824  20.770453 
50%  0.974243  25.244000  24.591000  24.007250  22.387833  21.624111  20.712315  39.207117  34.467175  0.749777  66.749465  26.030462  25.244000  24.397322  1.038170e+09  0.656475  23.760415 
75%  0.987142  29.210500  28.633250  27.278500  24.174250  23.017472  21.933203  46.661330  41.646060  1.179799  73.758827  30.086332  29.210500  27.848188  1.159543e+09  0.852259  27.627880 
max  1.007653  30.286000  29.820000  29.691500  26.243333  24.528111  23.411667  49.526365  46.601986  1.427553  82.767827  31.203224  30.286000  29.442553  1.301344e+09  1.065128  35.042981 
1  X.describe(include=['category']) 
17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  

count  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0  60.0 
unique  1.0  1.0  1.0  2.0  2.0  1.0  1.0  1.0  2.0  3.0  1.0  3.0  1.0  1.0  2.0  2.0  1.0  1.0  3.0  1.0  1.0  1.0  2.0  2.0  2.0  3.0  3.0  3.0  4.0  1.0  2.0  1.0  1.0  2.0  1.0  1.0  1.0  2.0  3.0  3.0  2.0  1.0  1.0  1.0  1.0  1.0  2.0  1.0  2.0  2.0  3.0  3.0  1.0  1.0  1.0  1.0  2.0  1.0  1.0  1.0  1.0 
top  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
freq  60.0  60.0  60.0  59.0  57.0  60.0  60.0  60.0  59.0  47.0  60.0  55.0  60.0  60.0  59.0  51.0  60.0  60.0  53.0  60.0  60.0  60.0  59.0  59.0  59.0  55.0  58.0  49.0  51.0  60.0  59.0  60.0  60.0  59.0  60.0  60.0  60.0  51.0  48.0  57.0  57.0  60.0  60.0  60.0  60.0  60.0  55.0  60.0  59.0  59.0  51.0  41.0  60.0  60.0  60.0  60.0  59.0  60.0  60.0  60.0  60.0 
1  unbalance = sum(y==1)/len(y) 
0.53333333333333333
1  data = X 
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', 'y'], dtype='object')
Quick look at the distribution of y.
1  if verbose: 
At first we can see that the target variable is distributed quite equally. We won't perform any actions to deal with imbalanced dataset. First we present the continuous data using boxplot (described in the following image)
Boxplot of y against continuous variables.
1  if verbose: 
Pairplot of all continous variables.
1  if verbose: 
Dummy encoding for categorical variables.
1  for i in range(17,78): 
Drop columns that contains only one value.
1  mask = data.std() == 0 
1  X = data_valid.drop('y', axis=1) 
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '19#0.0', '19#100.0', '21#100.0', '21#0.0', '24#100.0', '24#0.0', '26#100.0', '26#0.0', '26#100.0', '28#0.0', '28#100.0', '29#100.0', '29#0.0', '30#100.0', '30#0.0', '30#100.0', '31#200.0', '31#100.0', '31#0.0', '31#100.0', '32#0.0', '32#100.0', '36#100.0', '36#0.0', '36#100.0', '37#0.0', '37#100.0', '40#0.0', '40#100.0', '41#0.0', '41#100.0', '42#100.0', '42#0.0', '42#100.0', '45#100.0', '45#0.0', '49#0', '49#1', '51#0', '51#1', '53#0', '53#1', '54#0', '54#1', '55#0', '55#1', '56#0', '56#1', '57#0', '57#1', '58#0', '58#1', '59#0', '59#1', '60#0', '60#1', '61#0', '61#1', '62#0', '62#1', '64#0', '64#1', '65#0', '65#1', '66#0', '66#1', '68#0', '68#1', '69#0', '69#1', '70#0', '70#1', '72#0', '72#1', '75#0', '75#1', '76#0', '76#1']
Variance Ranking
1  vt = VarianceThreshold().fit(X) 
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '16', '31#0.0', '36#0.0', '42#0.0', '65#0', '70#0', '70#1']
Random Forest
1  model = RandomForestClassifier() 
['16', '2', '13', '14', '9', '8', '0', '1', '15', '10', '7', '6', '68#0', '70#1', '4', '64#0', '11', '30#0.0', '3', '5']
Chi2 Test
1  X_minmax = MinMaxScaler([0,1]).fit_transform(X) 
['42#100.0', '64#1', '36#100.0', '70#0', '68#1', '40#100.0', '31#100.0', '60#1', '76#1', '51#0', '30#100.0', '32#100.0', '24#100.0', '21#100.0', '53#0', '30#100.0', '57#1', '58#1', '59#1', '62#1']
Recursive Feature Elimination (RFE)
with logistic regression model.
1  rfe = RFE(LogisticRegression(),20) 
['1', '2', '3', '4', '5', '6', '7', '8', '10', '11', '12', '13', '14', '16', '36#0.0', '40#0.0', '42#0.0', '68#0', '70#0', '70#1']
Final selection of features
is the union of all previous sets.
1  features = np.hstack([feat_var_threshold,feat_imp_20,feat_scored_20,feat_rfe_20]) 
Final features (46 in total): 0, 1, 10, 11, 12, 13, 14, 15, 16, 2, 21#100.0, 24#100.0, 3, 30#100.0, 30#0.0, 30#100.0, 31#100.0, 31#0.0, 32#100.0, 36#100.0, 36#0.0, 4, 40#0.0, 40#100.0, 42#100.0, 42#0.0, 5, 51#0, 53#0, 57#1, 58#1, 59#1, 6, 60#1, 62#1, 64#0, 64#1, 65#0, 68#0, 68#1, 7, 70#0, 70#1, 76#1, 8, 9
1. Split the training and testing data (ratio: 3:1).
1  data_clean = data_valid.ix[:,features.tolist()+['y']] 
Clean dataset shape: (60, 47)Train features shape: (45, 46)Test features shape: (15, 46)Train label shape: (45,)Test label shape: (15,)
2. PCA visualization of training data
PCA plot
1  if verbose: 
Lmplot
1  if verbose: 
3. A New Accuracy Metric Based on Utility and RiskAversion
Instead of using existing accuracy or error metrics, e.g. accuracy scores and log loss, we tend to come up with our own metric that suits this scenario better. According to classical utility theory, the utility on the expected net return of a transaction should at least follow these properties:
Mathematically, therefore, we know a wellbehaved utility function \(U(x)\) has:
However, since late 20th century, this setting of utility has been widely criticized and the voice was mainly from behavioral economics. This group of people managed a huge number of empirical experiments and showed how poor such utility models work when the variation of riskaversion is considered. Riskaversion was originally introduced to catch the aversion of a human against uncertainty. In classical economics, there are a range of measures to depict such aversion. One of the most famous is Arrow–Pratt measure of absolute riskaversion (ARA), which is defined based on the utility function:
\[ARA = \frac{U^{\prime}(x)}{U^{\prime\prime}(x)}.\]
The ArrowPratt absolute riskaversion is well successful not only because it catches the concavity of the utility, but also it can be expanded into many special cases, mainly w.r.t. different classical utility functions like exponential or hyperbolic absolute utility. However, it is not in line with common sense, pointed by Daniel Kahneman and Amos Tversky in their prospect theory in 1972. The theory has been well further developed since 1992 and is now accepted as a more realistic model for uncertainty perception psychologically.
Different from the classical expected utility theory, the prospect theory specifies the utility in the following four implications:
Considering here the notion of "most people" is typically based on the fact tha most investors are more or less riskaverse, we simplify the model by giving assumptions as follows:
while for the loss aversion implication, we don't take it into consideration as the influence turned out to be minuscule compared with the loss of model simplicity.
Therefore, with the previous four assumptions, we can easily come up with a nice utility function w.r.t. prediction accuracy:
\[U(x) = sgn(x1/2)2x1^{2^{logit\left(\frac{r+1}{2}\right)}}\]
where \(sgn(\cdot)\) is the sign function and \(logit(\cdot)\) is the logit function, which is also known as the inverse of the sigmoidal "logistic" function:
\[logit(x)=\ln\left(\frac{x}{1x}\right).\]
It is easy to validate that our utility follows the configuration and because of its monotonicity and continuity in \([0,1]\), is a wellbehaved accuracy metric for further learning algorithms.
Utility Curve
1  f = lambda x, r: (2*(x>0.5)1)*abs(2*x1)**(2**logit((r+1)/2)) 
1  if verbose: 
1  def custom_score(y_true, y_pred): 
1  seed = 7 
First let's have a quick spotcheck.
1  # Some basic models 
LR: (0.117) +/ (0.436)LDA: (0.518) +/ (0.163)KNN: (0.255) +/ (0.263)DT: (0.255) +/ (0.263)GNB: (0.106) +/ (0.393)SVC: (0.117) +/ (0.436)
Let's first look at ensemble results.
1. Bagging (Bootstrap Aggregation)
Prediction of a bagging model is the average of all submodels.
Bagged Decision Trees
Bagged Desicion Trees performs the best when the variance is large in the dataset.
1  cart = DecisionTreeClassifier() 
(0.158) +/ (0.312)
Random Forest
Random forest is a famous extension to bagged decision trees. It is usually more precise but slower, especially for large number of leaves.
1  num_trees = 100 
(0.082) +/ (0.295)
Extra Trees
Randomness is introduced to investigate further precision.
1  num_trees = 100 
(0.015) +/ (0.484)
2. Boosting Boosting ensembles a sequence of weak learners for better performance.
AdaBoost
AdaBoost simply gives the weighted average of results by a series of weak learners, with updating the weight vector in each iteration.
1  model = AdaBoostClassifier(n_estimators=100, random_state=seed) 
(0.291) +/ (0.387)
Stochastic Gradient Boosting
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting. It uses arbitrary differentiable loss functions so that is more accurate and effective.
1  model = GradientBoostingClassifier(n_estimators=100, random_state=seed) 
(0.291) +/ (0.387)
Extreme Gradient Boosting
A (usually) more efficient gradient boosting algorithm by Tianqi Chen.
1  model = XGBClassifier(n_estimators=100, seed=seed) 
(0.189) +/ (0.459)
3. Hyperparameter tuning
As a matter of fact, hyperparameter tuning can matter a lot here, and thus to actual determine which models are the best, we need to run grid searching and cross validation of the training dataset for the best scores and the model configurations corresponing to them.
1  estimator_list = [] 
Logistic Regression
1  lr_grid = GridSearchCV(estimator = LogisticRegression(random_state=seed), 
0.3726887283268559{'penalty': 'l1', 'C': 1}
Linear Discriminant Analysis
1  lda_grid = GridSearchCV(estimator = LinearDiscriminantAnalysis(), 
0.5622889460982887{'n_components': None, 'solver': 'svd'}
Decision Tree
1  dt_grid = GridSearchCV(estimator = DecisionTreeClassifier(random_state=seed), 
0.34852862485121184{'criterion': 'gini', 'max_depth': None, 'max_features': None}
KNearest Neighbors
1  knn_grid = GridSearchCV(estimator = KNeighborsClassifier(), 
0.30539557863515217{'leaf_size': 2, 'algorithm': 'ball_tree', 'n_neighbors': 5, 'p': 1}
Random Forest
1  rf_grid = GridSearchCV(estimator = RandomForestClassifier(warm_start=True, random_state=seed), 
0.30113313283861066{'bootstrap': True, 'max_depth': 5, 'n_estimators': 100, 'max_features': None, 'criterion': 'entropy'}
Extra Trees
1  ext_grid = GridSearchCV(estimator = ExtraTreesClassifier(warm_start=True, random_state=seed), 
0.2601389681995039{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100, 'max_features': 20, 'criterion': 'entropy'}
AdaBoost
1  ada_grid = GridSearchCV(estimator = AdaBoostClassifier(random_state=seed), 
0.30113313283861066{'n_estimators': 200, 'algorithm': 'SAMME', 'learning_rate': 0.1}
Gradient Boosting
1  gbm_grid = GridSearchCV(estimator = GradientBoostingClassifier(warm_start=True, random_state=seed), 
0.4376019643046027{'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.01, 'max_features': None}
Extreme Gradient Boosting
1  xgb_grid = GridSearchCV(estimator = XGBClassifier(nthread=1, seed=seed), 
0.556562176665963{'gamma': 0, 'min_child_weight': 1, 'max_depth': 5, 'learning_rate': 0.01, 'n_estimators': 200}
Support Vector Classification
1  svc_grid = GridSearchCV(estimator = SVC(probability=True, class_weight='balanced'), 
0.3422932503617775{'gamma': 0.01, 'C': 0.1}
4. Voting ensemble
1  best_score_list_rounded = [round(s,3) for s in best_score_list] 
model  LR  LDA  DT  KNN  RF  EXT  ADA  GBM  XGB  SVC 

score  0.373  0.562  0.349  0.305  0.301  0.26  0.301  0.438  0.557  0.342 
rank  3  0  4  6  7  9  7  2  1  5 
1  # Create sub models 
(0.516) +/ (0.200)
It is clear that the ensemble model further enhanced the performance of the seperate models. Now we try to make actual predictions and see if the results are robust.
5. Make predictions
1  model = ensemble 
1  print('Unbalance of the data: {:.3f}'.format(unbalance)) 
Unbalance of the data: 0.533
Now apart from the utility, we can check our prediction based on some other metrics, e.g.:
Accuracy
which is defined by
\[\begin{align*}Accuracy&=\frac{True\ Positive+True\ Negative}{Total\ Polulation}\\&=\frac{True\ Positive+True\ Negative}{True\ Positive+False\ Positive+True\ Negative+False\ Negative}\\&=\frac{1}{n}\sum_{i=1}^n\mathbb{1}(\hat{y}_i=y_i)\end{align*}\]
and should be bounded within \([0,1]\), where \(1\) indicates perfect prediction. This is also called total accuracy, and it calculates the percentage of a right guess.
1  ac = accuracy_score(y_test, y_pred) 
Accuracy: 0.667
Precision
which is defined by
\[Precision = \frac{True\ Positive}{True\ Positive+False\ Positive}=\frac{\sum_{i=1}^n\mathbb{1}(\hat{y}_i=1\mid y_i=1)}{\sum_{i=1}^n[\mathbb{1}(\hat{y}_i=1\mid y_i=1)+\mathbb{1}(\hat{y}_i=1\mid y_i=0)]}.\]
Similar as accuracy, this is also bounded and indicates perfect prediction when the value is 1. However, precision gives intuition about the percentage of correct guesses among all your guesses, so in this case the probability that your actual transaction is in the right direction.
1  pc = precision_score(y_test, y_pred) 
Precision: 0.818
Recall
which is defined by
\[Recall = \frac{True\ Positive}{True\ Positive+False\ Negative}=\frac{\sum_{i=1}^n\mathbb{1}(\hat{y}_i=1\mid y_i=1)}{\sum_{i=1}^n[\mathbb{1}(\hat{y}_i=1\mid y_i=1)+\mathbb{1}(\hat{y}_i=0\mid y_i=1)]}.\]
Recall is also bounded and indicates perfect prediction when it's 1, but different from precision, it gives intuition about the percentage of actual signals being predicted, i.e. in this case, the probability that you catch an actual appreciation.
1  re = recall_score(y_test, y_pred) 
Recall: 0.750
Although not included in this notebook, we think it is very important and encouraging to mention what these scores mean, compared with when other scoring functions are used in grid searching. As a matter of fact, the average accuracy given by the ensemble model using accuracy or log loss directly is much lower than these figures above. In general, the model prediction accuracy has been improved from 15%  25% to 60%  80%, i.e. 2 to 5 times. The effect of introducing this utilitylike scoring function for hyperparameter tuning is substantial, though needs further theoretical proofs, of course.
Lastly, let's check this utility value for the testing dataset.
1  cs = custom_score(y_test, y_pred) 
Utility: 0.287
which is thus robust (even higher, in fact) out of sample.
The ensemble model I've just shown above is quite naive, I would say, and is far from "good". The metric is more or less still quite arbitrary and the algorithm is rather slow (so I set window length to 3 from 60 at first, which was intended to train a model on data of 5 years), and thus on an unprofessional platform like Ricequant we're not allowed to run through the whole market and search for a best portfolio  the most predictable ones. In the backtest strategy I implimented based on this research paper, I only chose 5 stocks arbitrarily from the index components of 000050.XSHG, and this can have an unpredictable downside effect on the model performance. A more desirable idea would be to set up a local backtest environment and implement this process with the help of GPU and faster languages like C++. Moreover, overfitting is possible, very possible, and thus whether it will make a good strategy needs much more work of validation.