I am recently playing a billiard game where you can play a series of exciting tournaments. In each tournament, you pay an entrance fee of, for example, \(\$500\), to potentially win a prize of, say, \(\$2500\). There are various kinds of tournaments with different entrance fees ranging from \(\$100\) up to over \(\$10000\). After hundreds of games, my winning rate stablized around \(58\%\), which is actually pretty good as it significantly beats random draws. A natural concept therefore came into my mind: Is there an optimal strategy?
Well, I think so. I'll list two strategies below and try to explore any potential optimality. We can reasonably model these tournaments as repetitive betting with certain fixed physical probability \(p\) of winning and odds^{[1]} of \((d1)\):\(1\) against ourselves. Given that there are sufficiently sparse tournament entrance fees, we may model these fees as a real variable \(x\in\mathbb{R}_+\) to maximize our long run profitability. Without loss of generality, let's assume an initial balance of \(M_0=0\) and that money in this world is infinitely divisible. The problem then becomes determination of the optimal \(x\in[0,1]\) s.t. the expected return is maximized. Nonetheless, regarding different interpretations of this problem we have several solutions. Some are intriguing while others may be frustrating.
Let's first take a look at potential values of \(x\) and the corresponding balance trajectories \(M_t\). For any \(0 \le x \le 1\), we have probability \(p\) to get an \(x\)fraction of our whole balance \(D\)folded and \(1p\) to lose it, that is \[\text{E}(M_{t+1}\mid\mathcal{F}_t) = (1x)M_t + p\cdot xdM_t + (1p)\cdot 0 =[1 + (pd1)x] M_t\] which indicates \(M_t\) is a submartingale^{[2]} as in this particular case, \(p=0.58\), \(d=5\) and thus \(pd=2.9 > 1\). So the optimal fraction is \(x^* = 1\), which is rather aggresive and yields a ruin probability of \(1p^n\) for the first \(n\) bets. Simulation supports our worries: not once did we survived \(10\) bets in this tournament, and the maximum we ever attained is less than a million.
If consider \(\log M_t\) instead, then \[\begin{align*}\text{E}(\log M_t\mid \mathcal{F}_t) &=p\cdot \log[(1x)M_t + xdM_t] +(1p)\cdot \log[(1x)M_t + 0]\\ &=p\cdot \log[(1(1d)x)M_t] +(1p)\cdot \log[(1x)M_t].\end{align*}\] The first order condition is \[\frac{\partial}{\partial x}\text{E}(\log M_t\mid \mathcal{F}_t) =\frac{p(1d)}{1(1d)x}+\frac{1p}{1x} = 0 \quad\Rightarrow\quadx^* = \frac{pd1}{d1}=0.475\] which is more conservative and therefore, should survive longer than the previous strategy. Simulation gives the following trajectories: even the worst sim beat the best we got when \(x=1\).
According to Doob's martingale inequality^{[3]}, the probability of our balance ever attaining a value no less than \(C = 1\times10^{60}\) in \(T=500\) steps is \[\text{P}\left(\sup_{t \le T}M_t\ge C\right) \le \frac{\text{E}(M_T)}{C} = \frac{M_0}{C} \prod_{t=0}^{T1}\frac{\text{E}(M_{t+1}\mid\mathcal{F}_t)}{M_t} =\frac{[1+(pd1)x]^T}{C} \approx 4.6\times10^{139} \gg 1.\] This implies the superior limit of the probability that our balance exceeds \(1\times10^{60}\) within \(500\) steps is one (instead of what simulation gave us, which is around \(0.31\)). To put it differently, we actually might be able to find a certain strategy that is even significantly better than the one given by the Kelly criterion.
What is it, then? Or, does it actually exist? I don't have an idea yet, but perhaps exploratory algorithms like machine learning will give us some hints, and perhaps the strategy is not static but rather dynamic.
I've recently sold my Nvidia GTX 1080 eGPU^{[1]} after two month's waiting in vain for a compatible Nvidia video driver for MacOS 10.14 (Mojave). Either Apple's or Nvidia's fault, I don't care any more. Right away, I ordered an AMD Radeon RX Vega 64 on Newegg. The card arrived two days later and it looked sexy at first sight. It's plugandplay as expected and performed just as good as its predecessor, regardless of serious gaming, video editing or whatever. I would have given it a 9.5/10 if not find another issue a couple of days later — wow, there is no CUDA on this card!
Of course there isn't. Cause CUDA was developed by Nvidia who's been paying great efforts on making a more userfriendly deeplearning environment. Compared with that, AMD (yes!) used to intentionally avoid a headtohead competition against world's largest GPU factory and instead keep making gaming cards with better costtoperformance ratios. ROCm, which is an opensource HPC/Hyperscaleclass platform for GPU computing that allows cards other than Nvidia's, does make this gap much narrower than before. However, ROCm is still publicly not supporting MacOS and you have to run a Linux bootcamp to utilize the computational benefits of your AMD card, even though you can already game smoothly on you Mac. Sad it is, AMD 😰.
There are, however, several solutions if you're people just like me who really have to run your code on a Mac and would like to accelerate those Renaissance training times with a GPU. The method I adapted was by using a framework called PlaidML, and I'd like to walk you through how I installed, and configured my GPU with it.
1  pip3 install plaidmlkeras plaidbench 
After installation, we can set up the intended device for computing by running:
1  plaidmlsetup 
PlaidML Setup (0.3.5)Thanks for using PlaidML!Some Notes: * Bugs and other issues: https://github.com/plaidml/plaidml * Questions: https://stackoverflow.com/questions/tagged/plaidml * Say hello: https://groups.google.com/forum/#!forum/plaidmldev * PlaidML is licensed under the GNU AGPLv3 Default Config Devices: No devices.Experimental Config Devices: llvm_cpu.0 : CPU (LLVM) opencl_intel_intel(r)_iris(tm)_plus_graphics_655.0 : Intel Inc. Intel(R) Iris(TM) Plus Graphics 655 (OpenCL) opencl_cpu.0 : Intel CPU (OpenCL) opencl_amd_amd_radeon_rx_vega_64_compute_engine.0 : AMD AMD Radeon RX Vega 64 Compute Engine (OpenCL) metal_intel(r)_iris(tm)_plus_graphics_655.0 : Intel(R) Iris(TM) Plus Graphics 655 (Metal) metal_amd_radeon_rx_vega_64.0 : AMD Radeon RX Vega 64 (Metal)Using experimental devices can cause poor performance, crashes, and other nastiness.Enable experimental device support? (y,n)[n]:
Of course we enter y
. Before I choose device 4 (OpenCL with AMD) or 6 (Metal with AMD), I'd like to benchmark on the default device, CPU (LLVM). The test script (on MobileNet as an example) is
1  plaidbench keras mobilenet 
and the result shows^{[2]}
Running 1024 examples with mobilenet, batch size 1INFO:plaidml:Opening device "llvm_cpu.0"Downloading data from https://github.com/fchollet/deeplearningmodels/releases/download/v0.6/mobilenet_1_0_224_tf.h517227776/17225924 [==============================]  2s 0us/stepModel loaded.Compiling network...Warming up ...Main timingExample finished, elapsed: 3.0688607692718506 (compile), 61.17863607406616 (execution), 0.059744761791080236 (execution per example)Correctness: PASS, max_error: 1.7511049009044655e05, max_abs_error: 6.556510925292969e07, fail_ratio: 0.0
Now we run the setup again and choose 4 (OpenCL with AMD). The result is
Running 1024 examples with mobilenet, batch size 1INFO:plaidml:Opening device "opencl_amd_amd_radeon_rx_vega_64_compute_engine.0"Model loaded.Compiling network...Warming up ...Main timingExample finished, elapsed: 2.6935510635375977 (compile), 13.741217851638794 (execution), 0.01341915805824101 (execution per example)Correctness: PASS, max_error: 1.7511049009044655e05, max_abs_error: 1.1995434761047363e06, fail_ratio: 0.0
Finally we run the test against the expected most powerful device, i.e. device 6 (Metal with AMD).
Running 1024 examples with mobilenet, batch size 1INFO:plaidml:Opening device "metal_amd_radeon_rx_vega_64.0"Model loaded.Compiling network...Warming up ...Main timingExample finished, elapsed: 2.243159055709839 (compile), 7.515545129776001 (execution), 0.007339399540796876 (execution per example)Correctness: PASS, max_error: 1.7974503862205893e05, max_abs_error: 1.0952353477478027e06, fail_ratio: 0.0
As a conclusion, by utilizing the Metal core on my Mac as well as the external AMD GPU, the training runtime was roughly 87.7% down and I'm personally quite satisfied with that.
]]>It's been more than two years since my last trip to the Arctic Circle when I was still studying in the Netherlands. Our adventurous hike in Abisko, in endless Northern European Mountains, was still a frequent dream of mine. This time we went to Fairbanks, Alaska, for Aurora and also, for another Arctic experience.
We spent five days in Fairbanks, five days and six nights. Apart from the two simple dinners we took on the hike and one with beef noodles at the arctic circle camp, the Pump House ended up as the very choice of our bestrecommended feasts. It is a fine dining restaurant and probably the best in the town, as you get a thumbup to this little house from nearly every local you meet. It is definitely one of the most enjoyable moments throughout our stay in Fairbanks, most warming, relaxing and tastebudexciting.
其实一开始的时候我们还打算去更远的一家 Turtle Club 吃吃看，毕竟那家也在 Yelp 上评分颇高。导游也曾推荐我们去试一下当地的一家（华人开的）自助，叫 AK Buffet，所幸最后有一餐午饭跟团去了那里——吃完大失所望，于是又暗自庆幸我们没有在之前选择浪费一餐在那里。最后的结果即是上面说的：在 the Pump House 吃了整整三天，把菜单上的推荐菜几乎完完整整点了一圈。三天下来，感觉生蚝与牛排不及预期，但海鲜名副其实。其中最为推荐的是他家的海鲜浓汤 Seafood Chowder，单点或上整个 bread bowl 都可以。汤头很浓郁，且与一般的海鲜浓汤不一样的是，他在奶味外还有一种类似鸡汤的鲜味。满满一勺入口，这种鲜味与各种海鲜的口感直接冲击味蕾，裹挟着一天的疲惫融成温暖、慰藉与幸福。这大概是我眼中一碗海鲜浓汤的全部意义了。相似的感觉也出现在他家的 Seafood Risotto 上。虾肉 Q 弹、带子紧实、蟹柳饱满。更重要的是米饭熟度恰到好处，不过分奶腻，也未过分夹生，可以说比肩甚至超越了我在欧洲吃到过的最好的 risotto 了。除开这两样，他家力推的 King Salmon 口感尚可，相比之下或许 Alaskan Halibut 煎制得更鲜美（至少，好于同页的 Alaskan Cod）。最后，自然不可避嫌地需要提一提他家的 Steamed King Crab——新鲜捕捞的阿拉斯加帝王蟹^{[1]}被用于最能保留海鲜风味的蒸制，不加任何佐料，甚至连余温都未有散去就上桌呈现给食客们。用专门的钳子打开，蟹肉温软而有弹性，散发着清蒸海鲜特有的香气。整一根蟹腿在手，甚至满满一口有多，这种满足感是阿拉斯加之外的任何地方不能给予的。
他家进门处立着一个一人多高的棕熊标本^{[2]}，憨态可掬，可惜去了三次都忘记拍照。我们坐了两次的座位上方，悬着一匹巨大的驼鹿标本，初见颇有压迫感，去过一次再看，倒也很具情调。The Pump House 地处 Fairbanks 最大的河流 Chena River 一岸，白天的景致据说很美，在夏季开放的露天座位据说又是另一番风情，这些我们都无从知晓了。最后从网上找来一张他家入口处的照片（由于我们去时都是晚上，所以一张都没拍），以飨各位。
I'm trying to write a chatroom in this post, using the socket
package^{[1]} in Python.
The general structure of this problem can be devided into three parts. In the simplest case we have two clients, namely client0
and client1
, and a server. Except for that the server provides the interface, everything else will remain the same among these three classes: they inherit from the class socket.socket
and have two methods sending
and recving
. The two methods are built to loop infinitely just so that all requests are accepted unattended. In the meantime, in order to avoid interruption between these two functions, we have to run them simultaneously using the threading
package. The two clients are reporting to different ports of the same host and the server listens to both, also in an infinite loop.
The code for server.py
:
1  import socket 
The code for client0.py
:
1  import socket 
The code for client1.py
:
1  import socket 
Start server.py
first and then the two clients. The terminal screenshot is as below.
Again, this is just a very simple, toylike chatroom and there're a lot to be implemented if you want to, like quiting schemes, frontend delivery and broadcasting in multiclient cases. However, I'm sure taking this as the starting point won't hurt. Enjoy coding!
socketserver
or something else. They may be convenient, but also sometimes redundant. ↩︎今天是黑五，在逛街。和国内几天前疯 hùn 狂 luàn 的双十一相比，芝加哥的节日气氛倒给人一种更亲切的感觉。圣诞的华彩已不知什么时候在街角巷落升起，白色的灯带沿着路的两旁一弯一弯地荡过去，一直到看不见的地方。的确，网上的促销没有浇灭美国人出门逛街的热情，Michigan Ave 上人头攒动。往来擦肩的，绝大多数是紧挽着手的男女，在风城的角角落落飘摇着、依偎着，像某天梦里雪原的驯鹿。不时也会遇到一家几口的，爸爸或妈妈轻牵着孩子的手，稚嫩的眼神与我不期而遇。除此以外，还碰到了一个在地铁口吹了一天爵士的黑人老哥。他坐在板凳上一曲一曲地不停演奏了整整一天，也并不和路人有什么交流。人来人往，单簧管的音色和夜色溶在了一起。
逛街最大的乐趣在于逛，其次才是买买买，而其中逛的部分又由吃喝玩乐一起呈现。吃、喝、玩、乐，语言的顺序暗示了它们在人心里的地位。对我来说也是一样——吃必是第一位的。今天逛街开心的事情有好多，但如若要我排幸福来源之 top 3 的话，两餐美食一定在其中。考虑到离 AMC 的距离和牛角又双叒叕一次被订满了，我最后选择了这家七百多评的 Niu Japanese Fusion Lounge（就在 AMC 楼下！这种 bug 一样的地理位置决定了我以后肯定会再来）。他家主打的是各色新鲜的巻き和握り，在 Yelp 和 OpenTable 上有不少好评，但最后我还是毅然决然地选择了刺身和拉面——如果一家日料店的寿司很出色，在价格可以接受的情况下，它的刺身一定会给我带来更爆裂的幸福感；而在诸多刺身之中，几个月不知鱼滋味的我会做出的选择肯定是三文鱼 and 三文鱼 only。事实上，这家 4.5 星、连队都不用排的小店完全没有辜负我对它的厚望。厚切的三文鱼饱满而新鲜，肉质紧实，纹理清晰，刀法干净。凑近闻不但几无腥味，反而还有淡淡的清新。入口，大块的鱼肉像橘红色的果冻滑至齿间，又像黄油一样迅速融化消解。22 刀一份的刺身足足有 9 片，又是厚切，但转眼就一干二净，徒留人坐在那里举箸四望怅然若失。回过头来一想，说明的确好吃。
几乎就是三文鱼吃完前一分钟，热腾腾的拉面就上来了……不好吃，别吃，period。
再早一些，中午的时候，去拔草了心水已久的老四川 (Lao Sze Chuan) 的烤鸭。不知道是这家店本来就格外热门还是因为感恩节假期（亦或两者兼有）的缘故，排队的人特别特别多。但前台把各人的名字和号码在纸上一个一个记了下来，所以相当于可以取号先离开，倒也很不错。他家的烤鸭与国内一样，可以一只或半只买。一两个人吃半只足矣，三四好友来的话，推荐点整只的。鸭子是现烤现片。师傅就站在门厅中间，有条不紊地一直片着他的鸭子，足量了，身边的帮工就会给某一桌上一盘，随带的有四色佐料和新蒸好的一客面皮。叫人着实惊喜的是，几分钟后他们又附赠了一份鸭肉熬的高汤。沉浸在烤鸭的幸福中的我面对鲜美的鸭汤几乎要落下泪来。鸭子不肥不瘦，酥皮不油不干，鸭汤不咸不腻，老四川不虚此行。
今天吃到这么多好吃的，很幸福。
逛街的时候偶然看到好大好大的一棵圣诞树，上面星星点点的，感觉很幸福。
吃完去看了《无名之辈》，笑得下巴要脱臼，最后又有点想哭，想来想去，还是很幸福。
这样就很好吧。
]]>This is a simple print
function overwritten so that you can specify different colors in Terminal outputs.
To use this feature, you'll need to import this customized print
function from the ColorPrint
package, the GitHub repo is here.
1  from ColorPrint import print 
and the output is as below (in Terminal):
I find this especially useful when you're trying to focus on commandline workflow only and don't want to build you own wheel over and over again.
]]>This is the first post of my ambitious plan trying to enumerate as many key points about the C++ language as I can. These notes are only for personal reviewing purposes and shall definitely be used commercially by anyone interested. Just please comment below for any missing C++ syntax or features. 👍🏻
Basically there're only one thing that needs attention. For standard libraries we use <
and >
and for local libraries we use quotes.
1 
These are files we declare functions and classes we want to use or implement in main files.
In C++, by loading the iostream
library we can read and write by
1 

Normally libraries come with classes, e.g. for iostream
we need to use std
everytime we need to print something. With namespaces we can reduce redundance.
1 

We may also use using namespace std;
which sometimes can cause problems.
There are a variety of data types in C++. For real numbers we have
Type  Bytes  Range 

float  \(4\)  \(\pm 3.4E\pm38\) 
double  \(8\)  \(\pm 1.7E\pm308\) 
where we should pay enough attention to the \(\pm\). For general integers we have
Type  Bytes  Range 

short  \(2\)  \(2E15\) to \(2E151\) 
int  \(4\)  \(2E31\) to \(2E311\) 
long  \(8\)  \(2E63\) to \(2E631\) 
and for each type we also have an unsigned version that starts from 0 and covers the same length of range.
We may notice that long
has a smaller range than float
despite the fact that the first data type actually costs more bytes than the latter. This is because the 4 bytes (or 32 bits) of a float
\(V\) is not stored equally in RAM, but rather
\[V = (1)^S \cdot M \cdot 2^E\]
where \(S\) is the first bit, and \(E\) the second through the ninth bits, and \(M\) for the tenth and so forth. So in a sense, because float
is more "sparse", the long
type has a smaller range.
Apart from other fundamental types like char
and bool
, we can also define our own data types or use types defined in libraries, e.g. std::string
. We may also use type aliases like
1  typedef double OptionPrice; 
We have operators for fundamental types:
Function  Operator 

assignment  = 
arithmetic  +  * / 
comparison  > < <= >= 
equality/nonequality  == != 
logical  &&  
modulo  % 
In C++ there're a set of shortcuts as follows:
Full Operator  Shortcut 

i = i + 1;  i++; i += 1; 
i = i  1;  i; i = 1; 
i = i * 1;  i *= 1; 
i = i / 1;  i /= 1; 
We may also use prefix and postfix in assignment, which are totally different. After
1  int x = 3; 
we have \(x = 4\) and \(y = 3\). After
1  int x = 3; 
we have \(x = y = 3\).
A general template for a C++ function:
1  resType f(argType1 arg1, argType2 arg2, ...) { 
Notice we may write multiple functions with the same resType
but with different arguments, which we call "parameter overloading". Meanwhile, even withou parameter overloading we can still use a function of double
on int
, because int
takes up less bytes and the implicit conversion is safe. We call it widening or promotion. In contrast, narrowing can be dangerous and cause a build warning.
In C++ we have two kinds of comments.
1  // This is inline comment 
In C++, people usually pass variables into functions by two methods: either by value or by reference. The first way creates a copy of the variable and nothing will happen to the original one. For the second, anything we do in the function will take effect on the original variable itself. The original variable must be declared once we create a reference, so
1  int x = 1; 
will compile, while below will not:
1  int x = 1; 
References can be extremely useful, especially when the original variable is a large object and making a copy costs considerable time and memory. However, this is potentially risky when we don't want to mess up with the original object when calling a function. So we need const references.
There're two situations we should take care when using the const
keyword with references. First, we can make a reference of a const variable, and we cannot change the value of it:
1  const int x = 1; 
We may also bind a const reference to a variable when the original itself is not const:
1  int y = 1; 
In this case we avoid making a copy while also keep the original variable safe from unexpected editing.
There is a third way of passing a variable, that is pointers. Pointers are variables the points to their addresses in memory. We declare a pointer by
1  int* pi; // legal but bad without initialization 
which comes with two unique operators: &
for the address of a variable, and \(*\) for the dereference of a pointer.
1  int i = 123; 
123
You can create a pointer pointing to a piece of dynamic memory for later deletion, in case memory is being an issue in your program.
1  int *p = new int; 
You can have pointers to a const variable, i.e. you cannot change its value through pointers.
1  const int x = 1; 
You can also have const pointers to variables, then you can change the value of the variable but never again the pointer (address) itself.
1  int x = 1; 
You can also have const pointers to const variables.
1  const int x = 1; 
Below is a general template for if/else
structures in C++.
1  if (condition1) { 
When there're multiple conditions, we can also use the switch
keyword.
1  switch (expression) { 
One of the most popular loops is while
loop.
1  while (condition) { 
It also has a variant called the do/while
loop.
1  do { 
which is slightly different from the while
loop in sequence.
Another form of loop that keeps track of the iterator precisely.
1  for (initializer; condition; statement1) { 
There is an unwritten rule that we usually write ++i
in statement1
because compared with i++
which need to make a copy, ++i
is more efficient. However, it's arguably correct because modern compilers can surely optimize this defect.
A simple but intuitive example of classes is to describe a people in C++ (here we assume type string
under the namespace std
is used):
1  class Person { 
We can also implement member functions in the class, just to make it more convenient:
1  class Person { 
We have three levels of data protection in a class:
This means we can protect data in the class by declaring them as private while get access to them via public member functions:
1  class Person { 
An instance created based on a class is called an object. To create an object, we may need a constructor, a copy constructor and a destructor.
1  class Person { 
Person
name_
.According to this coding style we have in Person.h
1 

In Person.cpp
we implement the member functions of the class:
1  string Person::GetEmail() { 
Just keep in mind that the constructors as well as the destructor should also be implemented:
1  Person::Person() { 
We can also use the colon syntax for constructors:
1  Person::Person() : name_(""), email_(""), stu_id_(0) {} 
a struct
is a class
with one difference: struct
members are by default public, while class
members are by default private.
1  struct Person { 
For a newly created class we cannot use person2 = person1
if we want to assign the whole object person1
to person2
. We have to use constructors. What we can do, instead, is to overload these operators (e.g. the assignment operator =
) specifically for the class.
The overloadable operators include +

*
/
%
^
&

~
!
=
<
>
<=
>=
++

<<
>>
==
!=
&&

+=
=
*=
/=
&=
=
^=
%=
<<=
>>=
[]
()
>
>*
new
new[]
delete
delete[]
.
The nonoverloadable operators are ::
.*
.
?=
.
1  void Person::operator=(const Person& another_person) { 
However, such overloading does not support chain assignment like person3 = person2 = person1
. We need to return a reference in order to support that.
1  Person& Person::operator=(const Person& another_person) { 
where this
is a pointer pointing to the object itself.
Another concern is selfassignment, which in some cases can be dangerous and in almost every situation is inefficient. To avoid selfassignment we need to detect and skip it.
1  Person& Person::operator=(const Person& another_person) { 
In C++, a function can only be defined once. This is called the One Definition Rule (ODR). To avoid multiple including of the header files, we use include guards. This is being done by defining a macro at the beginning of each header file.
1 

Here we introduce two of the most useful containers in the C++ Standard Library: std::vector
and std::map
. To initialize an empty vector, we use
1 

and to initialize with a specific size, we do
1 

On the other hand, map
containers are like dict
in Python, which allows you to use indiced of any type, e.g. std::string
.
1 

Data abstraction refers to the separation of interface (public functions of the class) and implementation:
Encapsulation refers to combining data and functions inside a class so that data is only accessed through the functions in the class.
We can declare friend
a function or class s.t. they can get access to the private and protected members of the base class.
1  class MyClass { 
and you can implement and use this function change_data
globally in the function to change my_data
.
Inheritance refers to based on the existing classes trying to:
A simple example would be
1  class Student { 
with meanwhile
1  class Employee { 
Apparently a lot of functions and data are repeated. What we're gonna do is to build a base class and reuse it onto two derived classes. Note:
In actual coding, this is what we do:
1  class Person { 
with
1  class Student : public Person { 
and
1  class Employee : public Person { 
To initialize a base class, we define constructors just like what we did before:
1  Person::Person(string name, string email) : name_(name), email_(email) {} 
while for derived classes, we need to call the base class constructor
1  Student::Student(string name, string email, string major) : Person(name, email), major_(major) {} 
A derived class can access members in the base class, subject to protection level restrictions. Protection levels public and private have their regular meanings in an inheritance class hierarchy:  A derived class cannot access private members of a base class.  A derived class can access public members of a base class.
A derived class can also access protected members of a base class. If a class has protected members:  That class can access them  A derived class of that class can access them  Everone else cannot access them
A base class uses the virtual
keyword to allow a derived class to override (provide a different implementation) a member function. If a function is virtual (in the base class):  The base class provides an implementation for that function; we call it the default implementation  A derived classes inherit the function interface (definition) as well as the default implementation  A derived class can provide a different implementation for that function (but it does not have to)
1  class Base1 { 
and then functions like Fun1
will be revisable in inheritance. Note that the base class has to implement all functions no matter they're virtual or not.
If we don't give a default implementation of a virtual function, we call it pure virtual. This is been done by assigning =0
at the time of definition.
1  class Base2 { 
In this case the base class does not need to implement this Fun1
and in contrast, the derived class must do so. A class with virtual functions is called an abstract class. Note that we cannot instantiate (make an object of) an abstract class until every virtual function is implemented.
There's a slight difference between normal member functions, virutal functions and pure virtual functions during inheritance.
We use a pointer or a reference to a base class object to point to an object of a derived class, which we call the Liskov Substitution Principle (LSP).
1  Option* option1 = nullptr; 
More direct example may be as follows. Instead of writing separately
1  double Price(EuropeanCall option, ...) { 
we can use polymorphism and write it w.r.t. the base class Option
using a reference or pointer
1  double Price(Option& option, ...) { 
For variables we declare constancy by
1  const int val = 10; 
For constant objects, e.g.
1  class Student { 
when we call
1  const Student a('Allen', 'allen@gmail.com'); 
we meed a compile error. This is due to that the compiler does not know the function GetName
is constant. To declare that we need
1  class Student { 
When we have pure virtual constant member functions, we write like this: virtual type f(...) const = 0
.
A const member function cannot modify data members. The only exception of this issue is mutable data members.
1  class Student { 
The override
keyword serves two purposes:
1  class base { 
In implementation of the pure virtual function foo
in derived class derived1
, we're doing just as told by the base class. In derive2
, with the override
keyword we'll get an error for overwriting the original virtual function by changing types; while without this keyword we'll get at most just a warning.
For nonstatic member we change an instance's data and it's done. Nothing will happen to other instances of the same derived class. For a static member function/data the association is built and we can change one and for all.
1  class Counter { 
A regular function is generally
1  int AddOne(int x) {return x + 1;} 
while a function object implementation is
1  class AddOne { 
and for the latter we can use its instances as objects, which still work as functions.
1  vector<int> values{1, 2, 4, 5}; 
where AddOne()
is an unnamed instance of the class AddOne
.
In C++ we have inline function definition as
1  int f = [](int x, int y) { return x + y; }; 
The []
is called the capture operator and it has rules as follows.
[=]
captures everything by value (read but no write access)[&]
captures everything by reference (read and write access)[=,&x]
captures everything but x by value, and for x by reference[&,x]
captures everything but x by reference, and for x by valueBelow we introduce some features in STL.
Two of the most commenly seen methods are begin()
and end()
We have binary_search
, for_each
, find_if
and sort
.
In the STL, algorithms are implemented as functions, and data types in containers.
1  int main() { 
1  int main() { 
1  bool PersonSortCriterion(const Person& p1, const Person& p2) { 
By combining STL algorithms with lambdas in C++ can be very efficient. We can use lambdas in a loop without defining a function beforehand.
1  vector<int> v{1, 3, 2, 4, 6}; 
We can also use it as a sorting criterion
1  std::vector<Person> ppl; 
We can have templates of a function:
1  template <class T> T sum(T a, T b) {return a + b; } 
We can also have templates of a class:
1  template <class T> 
1  int x, y; 
1 

Specifically, for the open modes we have
Mode  Description 

ios::app  Append to the end 
ios::ate  Go to the end of file on opening 
ios::binary  Open in binary mode 
ios::in  Open file for reading only 
ios::out  Open file for writing only 
ios::nocreate  Fail if you have to create it 
ios::noreplace  Fail if you have to replace 
ios::trunc  Remove all content if the file exists 
It's been months since my last update on cryptocurrency arbitrage strategies. The original version has been completely driven off the market and thus I decided to develop a new one. The market is primitive and savage in many senses, by which I mean there're supposed to be a bunch of inefficiency and corresponding arbitrage opportunities.
On the top of the page is the backtest PnL of the new strategy 4 from 01/01 up to yesterday, 07/26. I used 1 minute historical orderbook data, 5 spreads for slippage (not sure if it's still too conservative, need testing) and benchmarked the simplest buyandhold strategy. It's known that the whole crypto market has experienced a huge slump since late last year, so I guess my trick works quite well. The strategy is now running on my AWS in real money and I'll update this post whenever any interesting (or frustrating) issue happens.
Cheers.
Update Aug 3:
I changed the screening window length and the performance (of backtest) increased over tenfold. The image on the top has been updated with Sharpe ratios labelled.
]]>It's not hard to write a swap function. The most orthodox way that's being used in C++ or Java is by using a temporary variable. For example, say we have a = 0
and b = 1
, and we'd like to swap the values of these two variables. The pseudocode shall be something as below.
1  temp = a 
However, a more "Pythonic" way to do so is by literally "swapping" the values in place. Specifically, we don't even need to define a function for it, so the title picture is actually nonsense.
1  a, b = b, a 
How is that handled inside Python? Before answering that question, how is "Pythonic" defined? Well, Pythonic means code that doesn't just get the syntax right but that follows the conventions of the Python community and uses the language in the way it is intended to be used (Abien Fred Agarap^{[1]}). Talking about the conventions of the Python community, we won't be able to miss the famous Zen of Python:
1  import this 
Beautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.Special cases aren't special enough to break the rules.Although practicality beats purity.Errors should never pass silently.Unless explicitly silenced.In the face of ambiguity, refuse the temptation to guess.There should be one and preferably only one obvious way to do it.Although that way may not be obvious at first unless you're Dutch.Now is better than never.Although never is often better than *right* now.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.Namespaces are one honking great idea  let's do more of those!
So our oneline swapping exactly follows these supreme principles: it's beautiful, explicit, simple and perfectly readable. The only remaining question is, what happened when we called a, b = b, a
, and what are the technical differences between this lazy trick and the orthodox one?
Well, here is the thing. Just like most other programming languages, Python also handles assignment statements in a righttoleft manner. So before we actually assign the value of a
to b
and vice versa, Python pakcages the RHS as a tuple temporarily stored in memory. Then it assigns the values of this tuple to the LHS in order. That it. As a result, different from the orthodox swap function which creates a temporary variable temp
staying in our memory until being collected manually (if we're using it in the global environment) or after the function is destroyed, the Pythonic swap occupies doubled memory yet frees automatically thanks to Python's garbage collection. That's kind of a tradeoff and in some cases when absolute available memory is critically short, we might be suggested to use the more orthodox swap function.
Just as a supplement, there is in fact a way to swap in place while avoid using doubled memory. The trick is illustrated as follows.
1  a = a + b 
In case of large integers, we may also use XOR functions:
1  a = a ^ b 
My trading bot just ceased this morning from its loyal 24/7 service. It's running on an Amazon EC2 server with Ubuntu 16.04
and I'm sure this time I'm not having an unpaidbill issue any more. After some time digging I think I finally figure out the cause of this unexpected strike  asynchronism.
Asynchronism, or in simple terms, timing discrepancy, usually means a tiny bit of difference between local time on your computer/server and the global NTP time. It can be as undetectable as several milliseconds but in some applications like trading, such discrepancies are reckoned intolerable and any request sent from such computers/servers are ruthlessly rejected. Computers are just machines and they cannot be accurate in time forever. That's why we need (time) synchronization. In fact, EC2 does have such regular synchronization built in, but it seems it only happens once after a rather long period, like days. In order to adjust the synchronization period length to avoid similar issues in the future, I'll need the Amazon Time Sync Service.
First we install the chrony
package for synchronization, and open its configuration.
1  sudo apt install chrony 
Append in the opened chrony.conf
file a line as follows.
1  server 169.254.169.123 prefer iburst 
Restart chrony
service.
1  sudo /etc/init.d/chrony restart 
[ ok ] Restarting chrony (via systemctl): chrony.service.
Make sure that chrony
is successfully synchronizing time from 169.254.169.123
1  chronyc sources v 
210 Number of sources = 7 . Source mode '^' = server, '=' = peer, '#' = local clock. / . Source state '*' = current synced, '+' = combined , '' = not combined, / '?' = unreachable, 'x' = time may be in error, '~' = time too variable. . xxxx [ yyyy ] +/ zzzz Reachability register (octal) .  xxxx = adjusted offset, Log2(Polling interval) .   yyyy = measured offset, \   zzzz = estimated error.   \MS Name/IP address Stratum Poll Reach LastRx Last sample===============================================================================^* 169.254.169.123 3 6 17 12 +15us[ +57us] +/ 320us^ tbag.heanet.ie 1 6 17 13 3488us[3446us] +/ 1779us^ ec2123423112.euwest 2 6 17 13 +893us[ +935us] +/ 7710us^? 2a05:d018:c43:e312:ce77:6 0 6 0 10y +0ns[ +0ns] +/ 0ns^? 2a05:d018:d34:9000:d8c6:5 0 6 0 10y +0ns[ +0ns] +/ 0ns^? tshirt.heanet.ie 0 6 0 10y +0ns[ +0ns] +/ 0ns^? bray.walcz.net 0 6 0 10y +0ns[ +0ns] +/ 0ns
where ^*
denotes the preferred time source.
Finally, check the synchronization report.
1  chronyc tracking 
Reference ID : 169.254.169.123 (169.254.169.123)Stratum : 4Ref time (UTC) : Thu Jul 12 16:41:57 2018System time : 0.000000011 seconds slow of NTP timeLast offset : +0.000041659 secondsRMS offset : 0.000041659 secondsFrequency : 10.141 ppm slowResidual freq : +7.557 ppmSkew : 2.329 ppmRoot delay : 0.000544 secondsRoot dispersion : 0.000631 secondsUpdate interval : 2.0 secondsLeap status : Normal
As a conclusion, the server is now synchronizing time to the assigned source every 2 seconds and we shall never encounter similar issues.
]]>This was the last photo taken before we left Giethoorn, a small yet heavenly village. Hundreds of fragments are surrounded by tiny rivers and connected by wooden bridges only longer than a car. Talking about cars, the village was carfree and people commute by boats or bikes. We also love the thatchedroof houses which I suppose had been standing there for centuries, along with the wheat fields and the huge reed marshes.
The photo is probably my favorite shot throughout the past two years  if it is better than the foggymorning one taken in Hallstatt, Austria at the foot of the Alps.
]]>This is a note of Linear Discriminant Analysis (LDA) and an original Regularized Matrix Discriminant Analysis (RMDA) method proposed by Jie Su et al, 2018. Both methods are suitable for efficient multiclass classification, while the latter is a stateoftheart version of the classical LDA method s.t. data in matrix forms can be classified without destroying the original structure.
The plain idea behind Discriminant Analysis is to find the optimal partition (or projection, for higherdimensional problems) s.t. entities within the same class are distributed as compactly as possible and entities between classes are distributed as sparsely as possible. To derive closedform solutions we have various conditions on the covariance matrices of the input data. When we assume covariances \(\boldsymbol{\Sigma}\_k\) are equal for all classes \(k\in\{1,2,\ldots,K\}\), we're following the framework of Linear Discriminant Analysis (LDA).
As shown above, when we consider a 2dimensional binary classification problem, the LDA is equivalently finding the optimal direction vector \(\boldsymbol{w}\) s.t. the ratio of \(\boldsymbol{w}^T\boldsymbol{S}\_b\boldsymbol{w}\) (sum of betweenclass covariances of the projections) and \(\boldsymbol{w}^T\boldsymbol{S}\_w\boldsymbol{w}\) (sum of withinclass covariances of the projections) is maximized. Specifically, we define
\[\boldsymbol{S}_b = (\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)^T(\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)\]
and
\[\boldsymbol{S}_w = \sum_{\boldsymbol{x}\in X_0}(\boldsymbol{x}  \boldsymbol{\mu}_0)^T(\boldsymbol{x}  \boldsymbol{\mu}_0) + \sum_{\boldsymbol{x}\in X_1}(\boldsymbol{x}  \boldsymbol{\mu}_1)^T(\boldsymbol{x}  \boldsymbol{\mu}_1).\]
Therefore, the objective of this maximization problem is
\[J = \frac{\boldsymbol{w}^T\boldsymbol{S}_b\boldsymbol{w}}{\boldsymbol{w}^T\boldsymbol{S}_w\boldsymbol{w}}\]
which is also called the generalized Rayleigh quotiet.
The homogenous objective can be equivalently written into
\[\begin{align}\min_{\boldsymbol{w}}\quad &\boldsymbol{w}^T\boldsymbol{S}_b\boldsymbol{w}\\\\\text{s.t.}\quad &\boldsymbol{w}^T\boldsymbol{S}_w\boldsymbol{w} = 1\end{align}\]
which, by using the method of Langrange multipliers, gives solution
\[\boldsymbol{w} = \boldsymbol{S}_w^{1}(\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)\]
and the final prediction for new data \(\boldsymbol{x}\) is based on the scale of \(\boldsymbol{w}^T\boldsymbol{x}\).
For multiclass classification, the solution is similar. Here we propose the score function below without derivation:
\[\delta_k = \boldsymbol{x}^T\boldsymbol{\Sigma}^{1}\boldsymbol{\mu}_k  \frac{1}{2}\boldsymbol{\mu}_k^T\boldsymbol{\Sigma}^{1}\boldsymbol{\mu}_k + \log\pi_k\]
where \(\boldsymbol{\mu}\_k\) is the sample mean of all data within class \(k\), and \(\pi_k\) is the percentage of all data that is of this class. By comparing these \(k\) scores we determine the best prediction with the highest value.
We first load necessary packages.
1  %config InlineBackend.figure_format = 'retina' 
Now we define a new class called LDA
with a predict
(in fact also predict_prob
) method.
1  class LDA: 
Then we define three classes of 2D input \(\boldsymbol{X}\) and pass it to the classifier. Original as well as the predicted distributions are plotted with accuracy printed below.
1  np.random.seed(2) 
Training accuracy: 95.67%
For data with inherent matrix forms like electroencephalogram (EEG) data introduced in Jie Su (2018), the classical LDA is not the most appropriate solution since it forcibly requires vector input. To use LDA for classification on such datasets we have to vectorize the matrices and potentially losing some critical structural information. Authors of this paper invented this new method called Regularized Matrix Discriminant Analysis (RMDA) that naturally takes matric input in analysis. Furthermore, noticing that inversing large matrix \(\boldsymbol{S}_w\) in high dimensions can be computationally burdonsome, they adopted the Alternating Direction Method of Multipliers (ADMM) to iteratively optimize the objective instead of the widelyused Singular Valur Decomposition (SVD). A graphical representation of the RMDA compared with LDA is as follows.
The algorithm is implemented below. Notice here I skipped the Gradient Descent (GD) approach in the minimization during iterations and opt for the minimize
function in scipy.optimize
. I did so to make the structure simpler without hurting the understanding of the whole algorithm. For more detailed illustration please resort to the original paper.
Again we first define the class RMDA
. The predict
method now takes a matrix.
1  class RMDA: 
Then we train the model and print the final accuracy.
1  np.random.seed(2) 
Optimization converged successfully.Training accuracy: 87.00%
Further analysis and debugging should be expected. Any correction in comments is also welcomed. 😇
This is the fifth post on optimal order execution. Based on Almgren and Chriss (2000), today we attempt to estimate the market impact coefficient \(\eta\). Specifically, for highfrequency transaction data, we have the approximation \(dS = \eta\cdot dQ\) and thus can easily estimate it by the method of Ordinary Least Squares (OLS), using the message book data provided by LOBSTER.
We first explore the message book of Apple Inc. (symbol: AAPL
) from 09:30 to 16:00 on June 21, 2012.
1  import pandas as pd 
According to the instructions by LOBSTER, the columns of the message book are defined as follows:
1
means submission of a new limit order; 2
means Cancellation (partial deletion of a limit order); 3
means deletion (total deletion of a limit order); 4
means execution a visible limit order; 5
means Execution of a hidden limit order; 7
means Trading halt indicator (detailed information below)1
means means Sell limit order; 1
means Buy limit order1  message = pd.read_csv('data/AAPL_20120621_34200000_57600000_message_1.csv', header=None) 
time  type  id  size  price  direction  

0  34200.004241  1  16113575  18  585.33  1 
1  34200.025552  1  16120456  18  585.91  1 
2  34200.201743  3  16120456  18  585.91  1 
3  34200.201781  3  16120480  18  585.92  1 
4  34200.205573  1  16167159  18  585.36  1 
5  34200.201781  3  16120480  18  585.92  1 
6  34200.205573  1  16167159  18  585.36  1 
1  message_plce = message[message.type==1] 
Index(['time_x', 'type_x', 'id', 'size_x', 'price_x', 'direction_x', 'time_y', 'type_y', 'size_y', 'price_y', 'direction_y'], dtype='object')
1  df = message_temp[['id', 'time_x', 'time_y', 'size_y', 'price_x', 'direction_x']] 
(15099, 7)
Here I defined a function impact
to calculate the market impact (reflected on price deviation), such that for each successful execution, we calculate the price change after the same duration of the order.
1  def impact(idx): 
1  df['impact'] = [impact(i) for i in df.index] 
0  1  2  3  4  5  6  7  8  9  ...  2452  2453  2454  2455  2456  2457  2458  2459  2460  2461  

dQ  1.0  10.0  9.00  40.00  18.00  100.00  18.00  18.00  66.00  18.0  ...  100.00  19.0  10.0  90.00  10.00  40.00  50.00  1.00  100.00  100.00 
dS  0.2  0.2  0.03  0.19  0.07  0.09  0.21  0.03  0.05  0.0  ...  0.01  0.0  0.0  0.05  0.05  0.05  0.05  0.08  0.08  0.03 
1  fig = plt.figure(figsize=(14, 6)) 
1  res = sm.ols(formula='dS ~ dQ + 0', data=df_reg).fit() 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.148Model: OLS Adj. Rsquared: 0.148Method: Least Squares Fstatistic: 427.7Date: Sat, 12 May 2018 Prob (Fstatistic): 1.01e87Time: 14:02:16 LogLikelihood: 1535.7No. Observations: 2459 AIC: 3069.Df Residuals: 2458 BIC: 3064.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0005 2.37e05 20.680 0.000 0.000 0.001==============================================================================Omnibus: 2646.045 DurbinWatson: 1.287Prob(Omnibus): 0.000 JarqueBera (JB): 323199.873Skew: 5.154 Prob(JB): 0.00Kurtosis: 58.210 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Apparently there're several outliers that result in a low \(R^2\). Here we remove outliers that are lying outside three standard deviations.
1  df_reg_no = df_reg[((df_reg.dQ  df_reg.dQ.mean()).abs() < df_reg.dQ.std() * 3) & 
1  fig = plt.figure(figsize=(14, 6)) 
1  res = sm.ols(formula='dS ~ dQ + 0', data=df_reg_no).fit() 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.296Model: OLS Adj. Rsquared: 0.295Method: Least Squares Fstatistic: 1005.Date: Sat, 12 May 2018 Prob (Fstatistic): 1.45e184Time: 14:02:20 LogLikelihood: 2470.2No. Observations: 2397 AIC: 4938.Df Residuals: 2396 BIC: 4933.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0006 1.97e05 31.710 0.000 0.001 0.001==============================================================================Omnibus: 356.596 DurbinWatson: 1.108Prob(Omnibus): 0.000 JarqueBera (JB): 567.767Skew: 1.012 Prob(JB): 5.14e124Kurtosis: 4.259 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
So we conclude \(\hat{\eta}_{\text{AAPL}}=0.0006\) for the underlying timespan. However, what about other companies? The coefficients are expected to vary largely, which is though the very worst case we'd like to see.
We first define a function estimate
to automate what we've done above.
1  def estimate(symbol): 
The estimation for Microsoft Corp. (symbol: MSFT
) is as follows.
1  estimate('MSFT') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.229Model: OLS Adj. Rsquared: 0.228Method: Least Squares Fstatistic: 550.7Date: Sat, 12 May 2018 Prob (Fstatistic): 7.20e107Time: 14:04:51 LogLikelihood: 5732.8No. Observations: 1859 AIC: 1.146e+04Df Residuals: 1858 BIC: 1.146e+04Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 1.859e05 7.92e07 23.467 0.000 1.7e05 2.01e05==============================================================================Omnibus: 201.842 DurbinWatson: 0.778Prob(Omnibus): 0.000 JarqueBera (JB): 381.770Skew: 0.703 Prob(JB): 1.26e83Kurtosis: 4.719 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Amazon.com, Inc. (symbol: AMZN
) is as follows.
1  estimate('AMZN') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.294Model: OLS Adj. Rsquared: 0.293Method: Least Squares Fstatistic: 328.9Date: Sat, 12 May 2018 Prob (Fstatistic): 1.02e61Time: 14:06:56 LogLikelihood: 809.19No. Observations: 791 AIC: 1616.Df Residuals: 790 BIC: 1612.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0007 3.74e05 18.136 0.000 0.001 0.001==============================================================================Omnibus: 141.501 DurbinWatson: 1.022Prob(Omnibus): 0.000 JarqueBera (JB): 250.801Skew: 1.083 Prob(JB): 3.46e55Kurtosis: 4.709 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Alphabet Inc. (symbol: GOOG
) is as follows.
1  estimate('GOOG') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.419Model: OLS Adj. Rsquared: 0.418Method: Least Squares Fstatistic: 324.2Date: Sat, 12 May 2018 Prob (Fstatistic): 5.96e55Time: 14:07:20 LogLikelihood: 169.55No. Observations: 450 AIC: 337.1Df Residuals: 449 BIC: 333.0Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0017 9.57e05 18.005 0.000 0.002 0.002==============================================================================Omnibus: 48.913 DurbinWatson: 1.331Prob(Omnibus): 0.000 JarqueBera (JB): 61.896Skew: 0.864 Prob(JB): 3.63e14Kurtosis: 3.563 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Intel Corp. (symbol: INTC
) is as follows.
1  estimate('INTC') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.237Model: OLS Adj. Rsquared: 0.237Method: Least Squares Fstatistic: 444.2Date: Sat, 12 May 2018 Prob (Fstatistic): 4.52e86Time: 14:08:47 LogLikelihood: 4480.8No. Observations: 1429 AIC: 8960.Df Residuals: 1428 BIC: 8954.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 2.275e05 1.08e06 21.076 0.000 2.06e05 2.49e05==============================================================================Omnibus: 164.136 DurbinWatson: 0.716Prob(Omnibus): 0.000 JarqueBera (JB): 284.351Skew: 0.762 Prob(JB): 1.79e62Kurtosis: 4.566 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In sum, the market impact are generally significant but not leading to high \(R^2\) values, which suggests the linear assumption might be too strong. Also, it is noteworthy that \(\hat{\eta}\) does vary largely between companies (let alone industries or equity types), which means we cannot use one estimation as a benchmark for general production usage.
]]>Today we implement the order placement strategy in Almgren and Chriss (2000) s.t. for a certain order size \(Q\), we can estimate the probability to perform the optimal strategy in the paper within time horizon of \(T\).
It is tolerable^{[1]} in HFT that we assume stock price evolves according to the discrete time arithmetic Brownian motion:
\[\begin{cases}dS(t) = \mu dt + \sigma dW(t),\\\\dQ(t) = \dot{Q}(t)dt\end{cases}\]where \(Q(t)\) is the quantity of stock we still need to order at time \(t\). Now let \(\eta\) denote the linear coefficient for temporary market impact, and let \(\lambda\) denote the penalty coefficient for risks. To minimize the cost function
\[C = \eta \int_0^T \dot{Q}^2(t) dt + \lambda\sigma\int_0^T Q(t) dt\]
we have the unique solution given by
\[Q^*(t) = Q\cdot \left(1  \frac{t}{T^*}\right)^2\]
where \(Q\equiv Q(0)\) is the total and initial quantity to execute, and the optimal liquidation horizon \(T^*\) is given by
\[T^* = \sqrt{\frac{4Q\eta}{\lambda\sigma}}.\]
Here, \(\eta\) and \(\lambda\) are exogenous parameters and \(\sigma\) is estimated from the price time series (see the previous post) within \(K\) time units, given by
\[\hat{\sigma}^2 = \frac{\sum_{i=1}^n (\Delta_i  \hat{\mu}_{\Delta})^2}{(n1)\tau}\]
where \(\\{\Delta_i\\}\) are the first order differences of the stock price using \(\tau\) as sample period, \(n\equiv\lfloor K / \tau\rfloor\) is the length of the array, and
\[\hat{\mu}_{\Delta} = \frac{\sum_{i=1}^n \Delta_i}{n}.\]
Notice that \(\hat{\sigma}^2\) is proved asymptotically normal with variance
\[Var(\hat{\sigma}^2) = \frac{2\sigma^4}{n}.\]
Now that we know
\[\hat{\sigma}^2 \equiv \frac{16Q^2\eta^2}{\lambda^2 \hat{T}^4} \overset{d}{\to}\mathcal{N}\left(\sigma^2, \frac{2\sigma^4}{n}\right)\]
which yields
\[\frac{16Q^2\eta^2}{\lambda^2\hat{\sigma}^2\hat{T}^4}\overset{d}{\to}\mathcal{N}\left(1, \frac{2}{n}\right),\]
to keep consistency of parameters, with \(n\equiv \lfloor K/\tau\rfloor \to\infty\) we can also write
\[\frac{16Q^2\eta^2}{\lambda^2\hat{\sigma}^2\hat{T}^4}\overset{d}{\to}\mathcal{N}\left(1, \frac{2\tau}{K}\right).\]
with which we can estimate the probability of successful strategy performance. Specifically, the execution strategy is given above, and the expected cost of trading is
\[C^* =\eta \int_0^{T^*} \left(\frac{2Q}{T}\left(1  \frac{t}{T^*}\right)\right)^2 dt + \lambda\sigma\int_0^{T^*} Q\cdot \left(1  \frac{t}{T^*}\right) dt =\frac{4\eta Q^2}{3T^*} + \frac{\lambda \sigma QT^*}{3} = \frac{4}{3}\sqrt{\eta\lambda\sigma Q^3}.\]
1  import numpy as np 
(1.465147881156472, 0.8431842483948604)
which means there's a probability of 84.3% that we can perform our order placement strategy of size 10 within 3.6405 time units and minimize the trading cost of 1.47 at optimum.
How to estimate the parameters of a geometric Brownian motion (GBM)? It seems rather simple but actually took me quite some time to solve it. The most intuitive way is by using the method of moments.
First let us consider a simpler case, an arithmetic Brownian motion (ABM). The evolution is given by
\[dS = \mu dt + \sigma dW.\]
By integrating both sides over \((t,t+T]\) we have
\[\Delta \equiv S(t+T)  S(t) = \left(\mu  \frac{\sigma^2}{2}\right) T + \sigma W(T)\]
which follows a normal distribution with mean \((\mu  \sigma^2/2)T\) and variance \(\sigma^2 T\). That is to say, given \(T\) and i.i.d. observations \(\\{\Delta_1,\Delta_2,\ldots,\Delta_n\\}\) for different \(t\) values^{[1]}, with sample mean
\[\hat{\mu}_{\Delta} = \frac{\sum_{i=1}^n\Delta_i}{n}\overset{p}{\to}\left(\mu  \frac{\sigma^2}{2}\right)T\]
and modified sample variance
\[\hat{\sigma}_{\Delta}^2 = \frac{\sum_{i=1}^n (\Delta_i  \hat{\mu}_{\Delta})^2}{n1} \overset{p}{\to} \sigma^2 T,\]
we have unbiased estimator for \(\mu\)
\[\hat{\mu} = \frac{2\hat{\mu}_{\Delta} + \hat{\sigma}_{\Delta}^2}{2T}\]
and for \(\sigma^2\) we have
\[\hat{\sigma}^2 = \frac{\hat{\sigma}_{\Delta}^2}{T}.\]
Now we prove the consistency. First we consider the variance of \(\hat{\mu}_{\Delta}\)
\[Var(\hat{\mu}_{\Delta}) = \frac{Var(\Delta_1)}{n} = \frac{\sigma^2 T}{n}\]
and the variance of \(\hat{\sigma}_{\Delta}^2\)
\[Var(\hat{\sigma}_{\Delta}^2) =E(\hat{\sigma}_{\Delta}^4)  E(\hat{\sigma}_{\Delta}^2)^2 =\frac{n E[(\Delta_1\hat{\mu}_{\Delta})^4] + n(n1) E[(\Delta_1\hat{\mu}_{\Delta})^2]^2}{(n1)^2}  \sigma^4T^2 =\frac{2\sigma^4T^2}{n}.\]
The variance of \(\hat{\mu}\) is therefore given by
\[Var(\hat{\mu}) =\frac{4Var(\hat{\mu}_{\Delta}) + Var(\hat{\sigma}_{\Delta}^2)}{4T^2} =\frac{\sigma^2 (2 + \sigma^2T)}{2nT}\]
and the variance of \(\hat{\sigma}^2\) is given by
\[Var(\hat{\sigma}^2) =\frac{Var(\hat{\sigma}_{\Delta}^2)}{T^2} =\frac{2\sigma^4}{n}.\]
So the two estimators are also both consistent. It should be noticed that there exists certain "tradeoff" between the efficiency of \(\hat{\mu}_{\Delta}\) and \(\hat{\sigma}_{\Delta}^2\) by varying the value of \(T\).
For a general GBM with drift \(\mu\) and diffusion \(\sigma\), we have PDE
\[\frac{dS}{S} = \mu dt + \sigma dW,\]
so we can integrate^{[2]} the both sides within \((t,t+T]\) for any \(t\) and get
\[\Delta \equiv \ln S(t+T)  \ln S(t) = \left(\mu  \frac{\sigma^2}{2}\right) T + \sigma W(T).\]
The rest derivation is exactly the same.
Now we numerically validate this against monte Carlo simulation.
1  import numpy as np 
Statistics  monte Carlo  Method of moment  P Value 

E(mu_hat)  1.994533e03  2.000000e03  0.222191 
Var(mu_hat)  4.010866e07  3.924000e07   
E(sigma2_hat)  3.596733e03  3.600000e03  0.201573 
Var(sigma2_hat)  1.308537e07  1.296000e07   
Now we may safely apply this estimation in application.
Here I'm trying to write something partly based on Cont's first model in the previous post. I plan to skip the Laplace transform and go for Monte Carlo simulation. Also, I'm trying to abandon the assumption of unified order sizes. To implement that, I need to shift from a Markov chain which is supported by discrete spaces, onto some other stochastic process that is estimatable. Moreover, although I actually considered supervised learning for this problem, I gave it up at last. This is because my model is inherently designed for high frequency trading and thus training for several minutes each time would be intolerable.
1  import smm 
I need smm
for multivariate stochastic processes, and scipy.optimize
for maximum likelihood estimation.
1  def retrieve_data(date): 
time  ask_price_1  ask_price_10  ask_price_100  ask_price_101  ask_price_102  ask_price_103  ask_price_104  ask_price_105  ask_price_106  ...  bid_vol_90  bid_vol_91  bid_vol_92  bid_vol_93  bid_vol_94  bid_vol_95  bid_vol_96  bid_vol_97  bid_vol_98  bid_vol_99  

1  20180129 00:00:06.951631+08:00  12688.00  12663.58  12391.48  12390.00  12389.96  12388.00  12384.22  12381.39  12380.00  ...  6.0  15.0  1.0  460.0  4.0  121.0  5.0  1.0  5.0  120.0 
2  20180129 00:00:07.792882+08:00  12676.93  12657.04  12391.48  12390.00  12389.96  12388.00  12384.22  12381.39  12380.00  ...  1.0  400.0  363.0  5.0  6.0  15.0  1.0  460.0  4.0  121.0 
3  20180129 00:00:08.702945+08:00  12643.27  12617.26  12361.27  12360.00  12359.38  12358.06  12356.22  12355.44  12354.17  ...  6.0  15.0  1.0  460.0  4.0  121.0  5.0  1.0  5.0  120.0 
4  20180129 00:00:10.998615+08:00  12666.00  12642.73  12380.00  12377.00  12374.99  12369.73  12366.43  12365.84  12361.45  ...  460.0  4.0  121.0  5.0  1.0  5.0  120.0  150.0  12.0  97.0 
5  20180129 00:00:11.742304+08:00  12674.00  12643.27  12384.22  12381.39  12380.00  12377.00  12374.99  12369.73  12366.43  ...  4.0  121.0  5.0  60.0  1.0  5.0  120.0  150.0  12.0  97.0 
Larger index means smaller values for both bid and ask prices. It's uncommon and here I reindexed the variables s.t. bid_1
and ask_1
corresponds with the best opponent prices.
1  def rename_index(s): 
1  variables = list(data.columns[1:]) 
I dropped the time
variable simply because I don't know how to use it. Normally there're two ways to handle uneven timegrids: resampling and ignoring, and I chose the latter.
1  def plot_lob(n, t, theme='w'): 
Now we make a plot of the order book within the past 10 steps, including 20 bid levels and 20 ask levels.
1  n, t = 20, 10 
Not sure if it tells any critical information. Let's make another plot. This time \(t=500\) and we only consider the best bid and ask orders.
1  fig = plt.figure(figsize=(12, 6)) 
1  price = data[[f'bid_price_{i}' for i in range(n,0,1)] + [f'ask_price_{i}' for i in range(1,n+1)]] 
A simple idea would be inputting the prices and volumes in the current orderbook, and predict the future mid prices. Furthermore, it's ideal to have a rough expectation of the minimum time that the mid price crosses a certain price, or the time needed in expectation before my order got executed successfully.
1  change = [] 
The calculation of change
took over 10 minutes. I don't think it's gonna be useful in real work. However, it's not so bad an idea to save it somewhere locally in case I need it later.
1  change = pd.DataFrame(np.array(change).astype(int), columns=vol.columns) 
1  change = pd.read_csv(f'data/change_{date}.csv', index_col=0) 
After some research, I decided to fit the data in change
to student's tdistribution, Skellam distribution, and twoside Weibull distribution. I'll now elaborate reason why I chose, and how to estimate each distribution below.
First is the tdistribution. It is wellknown for its leptokurtosis which suits well in many financial time series as a better alternative to Normal distribution. The PDF and CDF of the tdistribution involves the Gamma function and thus would be computationally troublesome when we want to calculate the MSE of the parameters. However, notice for any r.v. \(X\sim t(\nu,\mu,\sigma)\), we have relationship
\[\text{Var}(X) = \begin{cases}\frac{\nu}{\nu  2} & \text{for }\nu > 2,\\\infty & \text{for }1 < \nu \le 2,\\\text{undefined} & \text{otherwise}\end{cases}\]
and
\[\text{Kur}_+(X) = \begin{cases}\frac{6}{\nu  4} & \text{for }\nu > 4,\\\infty & \text{for }2 < \nu \le 4,\\\text{undefined} & \text{otherwise}\end{cases}\]
where \(\text{Kur}_+\equiv \text{Kur}  3\) is the excess kurtosis, we can simply go for moment estimation for tdistribution using empirical variance or kurtosis.
Second, the Skellam distribution. This is mainly due to the original model used in Cont's paper, where he assumes Poisson order arrivals uniformly over the time. Here I slightly improve the model s.t. bid and ask orders are modelled in the same time and represented by r.v. \(S\equiv P_a  P_b\) where \(P_a\sim Pois(\lambda_a)\) and \(P_b\sim Pois(\lambda_b)\). This is therefore a discrete distribution with two parameters. scipy.stats
has its PMF implemented and all I need to do is numerically maximize the likelihood.
For the twosided Weibull distribution, it is given by
\[Y = \begin{cases}\text{Weibull}(\lambda_1, k_1) & \text{if } Y < 0,\\\text{Weibull}(\lambda_2, k_2) & \text{otherwise}\end{cases}\]
where shape parameters $k_{1,2}0 $ and scale parameters \(\lambda_{1,2} > 0\).
Therefore, the pdf is
\[f(y \mid \lambda_1, k_1, \lambda_2, k_2) = \begin{cases}\left(\frac{y}{(\lambda_1)}\right)^{k_1 1}\exp\left(\left(\frac{y}{(\lambda_1)}\right)^{k_1}\right) & \text{if y < 0},\\\left(\frac{y}{(\lambda_2}\right)^{k_2 1}\exp\left(\left(\frac{y}{(\lambda_2)}\right)^{k_2}\right)& \text{otherwise}\end{cases}\]
and to normalize the integration to \(1\), we also have
\[\frac{\lambda_1}{k_1} + \frac{\lambda_2}{k_2} = 1 \Rightarrow \lambda_2 = k_2 (1  \lambda_1 / k_1)\]
which means there're in fact only three parameters to estimate.
Now we rewrite the loglikelihood as
\[\begin{align\*}LL = \sum_{i=1}^n \log(f(y_i))= \sum_{i=1}^n &\left((k_11)(\log^\*(y_i)  \log^\*(\lambda_1))  (y_i / \lambda_1)^{k_1}\right)\mathbb{I}_{y_i < 0} + \\ &\left((k_21)(\log^\*(y_i)  \log^\*(\lambda_2))  (y_i / \lambda_2)^{k_2}\right)\mathbb{I}_{y_i \ge 0}.\end{align\*}\]
where we have the special \(\log^*(y)\equiv 0\) if \(y\le0\).
1  i = 15 # take ask_15 for example 
As coded above, at last I didn't include twosided Weibull distribution because the optimization did not converge. In conclusion, for changes of order sizes (denoted by \(x\)), we use modified tdistribution with
\[\hat{\mu} = \bar{x},\quad \hat{\sigma} = 0.3 \cdot \sqrt{\widehat{\text{Var}}(x)} + 0.7 \cdot \sqrt{\left(2  \frac{6}{6 + 2\ \widehat{\text{Kur}}_+(x)}\right)}\]
and
\[\hat{\nu} = \frac{6}{\widehat{\text{Kur}}_+(x)} + 4\]
where
\[\widehat{\text{Kur}}_+(x) = \widehat{\text{Kur}}(x)  3\]
while
\[\widehat{\text{Kur}}(x) = \hat{m}_4(x) / \hat{m}_2^2(x)\]
and
\[\hat{m}_4 = \sum_{i=1}^n (x_i  \bar{x})^4 / n,\quad \hat{m}_2 = \sum_{i=1}^n (x_i  \bar{x})^2 / n.\]
Now, when we assume independence across different buckets of order book, we can estimate the parameters of tdistributions as below.
1  params = np.zeros([2 * n, 3]) 
array([[ 5.1589201 , 0.52536232, 11.05729 ], [ 5.86412495, 0.61454545, 12.08484143], [ 5.82376701, 4.61231884, 11.67543236], [ 6.28819815, 0.7173913 , 10.85941723], [ 6.89178374, 1.59927798, 11.25140225], [ 6.14231284, 2.29856115, 12.46686452], [ 6.4347771 , 2.22302158, 13.73785226], [ 6.17737187, 0.67753623, 12.19098061], [ 5.9250571 , 1.68231047, 12.54472066], [ 5.16886809, 0.69090909, 11.94199489], ... [ 5.94772822, 3.18181818, 12.4415555 ], [ 6.5157695 , 4.62181818, 13.67098387], [ 6.69385395, 0.66304348, 13.63770319], [ 4.99329442, 1.11510791, 11.63780506], [ 5.04144977, 1.91756272, 11.20026029], [ 5.47054269, 4.34163701, 10.66971035], [ 5.11684414, 2.35460993, 9.98656422], [ 4.89130697, 1.07092199, 11.5511127 ], [ 5.31202782, 0.58865248, 11.01769165], [ 5.17908162, 2.16961131, 10.81368767]])
When we do not ignore the correlation across all buckets, a multivariate tdistribution must be considered. Similar to multivariate Normal distributions, here we need to estimate a covariance matrix, a vector of expectations and a vector of degrees of freedom. Notice the degrees of freedom do not vary significantly across the rows in params
, to accelerate computation I set a unified degree of freedom for all buckets, namely \(df = 7\). Using Expectation Maximization (EM) algorithm introduced by D. Peel and G. J. McLachlan (2000), I wrote the model below to estimate this distribution.
1  class MVT: 
Now the distribution for order size movement is estimated. We can simulate the trajectory and rebuild the order book in future several steps. Specifically, notice the predicted movement may well change the shape of the order book while, according to practical observation, the order book retains its "V"shape in most of the time. Therefore, I sort up separately both halves of the order book every time they're updated by a predicted order size movement (or "comovement", since it should be a vector).
1  n_steps = 20 
Below is a simple sketch of this order book trajectory where I assign stronger color to the traces that are closer to the best (bid/ask) prices.
1  fig = plt.figure(figsize=(12, 6)) 
It can be seen from the figure that stronger traces are located more to the bottom, which validates our intuition since trades around the current price are more active than those to the left or the right of the order book.
With this prediction procedure implemented, we can estimate the probability of our order (placed at the price bucket order_idx
with size order_size
) being executed within n_steps
.
1  n_steps = 10 
0.861
So a limit buy order at bid_8
(\(20  12 = 8\)) with size 100 can be executed within 10 steps, at a probability of 86.1%. Moreover, we can even make a 3D surface plot to get a comprehensive idea of the whole distribution.
1  def evolve(order_idx, order_size, n_steps=10, n_sim=1000): 
Today, I'll continue introducing papers about optimal order execution and particularly, in this post I'll mainly walk through six papers by Rama Cont, respectively in 2010 and 2018. Professor Cont is renowned for his indepth research in stochastic analysis, stochastic processes and mathematical modeling in quantitative finance. He's written dozens of papers concerning the order book dynamics by building rigorous mathematical models.
In this classic paper, the authors tried to model realworld order book as a discretetime Markov chain. The order book is evenly divided into several buckets of prices, where order sizes are recalculated s.t. positive sizes represent ask orders, and negative sizes represent bid orders. Let's denote this order book by \(\boldsymbol{x}\in\mathbb{Z}^n\). Also, let \(\boldsymbol{x}_{p\pm 1} \equiv \boldsymbol{x} \pm \boldsymbol{e}^p\) where \(\boldsymbol{e}^p\in\mathbb{Z}^n\) is the \(p\)th base vector. Denote the best ask and bid prices by \(p^a\) and \(p^b\). By assuming unitsized orders^{[1]} and conditioning on the inflow of new orders, the Markov state transitioning can be described as below:
Furthermore, the authors assumed stationary Poisson arrivals for these inflows in each bucket. Arrival rate for limit orders \(\lambda(p)\) is an increasing function when \(p\) is smaller than the current price, and is decreasing when \(p\) is larger than the current price. Arrival rate for market orders is assumed to be constant \(\mu\), and arrival rate for order cancellations should by assumption be proportional to the current order size in the underlying bucket of the book.
Therefore, we have
The empirical performance of onestep ahead prediction is illustrated below.
It is easy to recognize that the underlying random walk is a birthdeath process. Hence, we may opt for Laplace transforms to calculate the firstpassage times of our model, i.e. the time before our order is successfully executed given that the mid price hasn't moved.
In this paper of Cont, he tried to model untrahigh frequency (UHF) order book using fluid and diffusion models.
Regime  Time scale  Issues 

Untrahigh frequency  \(\sim 10^{3}  1\) s  Microstructure, latency 
High frequency  \(\sim 10  10^2\) s  Optimal execution 
Daily  \(\sim 10^3  10^4\) s  Trading strategies, option hedging 
By going from UHF to even more ideal data where we assume tick size \(\to\) 0, order arrival frequency \(\to\infty\) and order size \(\to\) 0, we may apply multiple asymptotic theorems to analyze the order book dynamics in this very extreme case. Different combinations of scaling assumptions are possible for the same process and might lead to very different limits. Specifically, when we assume that variances vanishes asymptotically, the limit process is thus deterministic and often given by a PDE or ODE. This functional equivalent of Law of Large Numbers is called "fluid" or "hydrodynamic" limit, e.g.
\[\lambda_n^i\sim n\lambda^i,\quad \left(\frac{N_1^n  N_2^n}{n}, t\ge 0\right) \overset{n\to\infty}{\to} ((\lambda^1  \lambda^2)t, t\ge 0).\]
Other scaling assumption can lead to a totally different limit, e.g. "random" or "diffusion" limit:
\[\lambda_n^i\sim n\lambda, \quad \lambda_n^1  \lambda_n^2 = \sigma^2\sqrt{n},\quad \left(\frac{N_1^n  N_2^n}{\sqrt{n}}\right)\overset{n\to\infty}{\to}\sigma W.\]
Similar to the first paper, here Cont and de Larrard tried to model order books as a Markov chain where limit orders, market orders and order cancellations arrives following stationary Poisson processes. Specifically, in this paper, the arrival rate of limit orders is constant. This is assumed in the hope of deriving closeform results analytically. Between each stock price change, we have \(q_t^a\) and \(q_t^b\) are independent birthdeath processes with birth rate \(\lambda\) and death rate \(\mu+\theta\). Define \(\sigma^a\) as the firstpassage time for process \(q_t^a\). Similarly define \(\sigma^b\). Then, the time duration before the next price move is given by \(\tau = \min\\{\sigma^a, \sigma^b\\}\).
Conditional Laplace transform of \(\sigma^a\) solves
\[\mathcal{L}(s, x) = \text{E}(\exp(s\sigma^a)\mid q_0^a = x) = \frac{\lambda \mathcal{L}(s, x+1) + (\mu+\theta)\mathcal{L}(s,x1)}{\lambda+\mu+\theta+s}\]
which eventually gives
\[\mathcal{L}(s, x) = \left(\frac{(\lambda + \mu + \theta + s)  \sqrt{(\lambda + \mu + \theta + s)^2  4 \lambda (\mu + \theta)}}{2\lambda}\right)^x.\]
The distribution of \(\tau\) conditional on the current queue length is
\[\text{P}(\tau > t\mid q_0^a = x, q_0^b = y)= \text{P}(\sigma^a > t\mid q_0^a = x) \text{P}(\sigma^b > t\mid q_0^b = y)= \int_t^{\infty} \hat{\mathcal{L}}(u, x) du\int_t^{\infty} \hat{\mathcal{L}}(u, y) du\]
where the inverse Laplace transform \(\hat{\mathcal{L}}\) is given by
\[\hat{\mathcal{L}}(t, x) = \frac{x}{t} \sqrt{\left(\frac{\mu + \theta}{\lambda}\right)^x} I_x\left(2t\sqrt{\lambda(\theta + \mu)}\right)\exp(t(\lambda + \mu + \theta)).\]
In the case of heavytraffic queueing systems, the order books tend to show "diffusive" dynamics. In the most extreme scenario \(\lambda = \mu + \theta\), we have
\[\left(\frac{s_t(n\log n)}{\sqrt{n}}\right)_{t\ge 0} \overset{d}{\to} \sqrt{\frac{\pi \lambda \delta^2}{D(f)}B}\]
where \(B\) is a Brownian motion, and
\[D(f)\equiv \left(\int_{\mathbb{R}_+^2} xy\ dF(x, y)\right)^{1/2}\]
is the geometric mean of the bid and ask queue lengths. This directly gives a diffusion process with variance
\[\sigma^2 = \delta^2 \frac{\pi\lambda}{D(f)}.\]
Interestingly, the formula does not require checking the stock price to estimate the volatility. Instead, all information it needs is from order flow statistics: arrival rate \(\lambda\), and average order size \(D(f)\).
Deep learning is leading the fashion in academia. In the justfinished 6th ImperialETH Workshop in Mathematical Finance, Cont introduced his LongShortTerm Memory (LSTM) network built with Sirignano that claimed to have defeated a range of other wellstudied mathematical models (see below for the network structure). The model takes historical order book states as input and predicts next price moves. Specifically, they used the historical data from approximately 1,000 stocks traded on NASDAQ and trained the network asynchronously on over 500 GPUs. Results show significant improvement on prediction accuracy by introducing longterm memory into the model and moreover, a tendency of universal effectiveness even for stocks out of the sample.
1  # create virtualenv myenv 
It is supported by the package hexofilterflowchart
:
1  npm install save hexofilterflowchart 
You can configure your flowchart layout in the site's _config.yml
:
1  flowchart: 
Raw code like
1  s=>start: log in; counter := 0 
in a flow
code block, produces a nice flowchart as below: