A couple of months ago I was asked the following question during an interview (for propriatary concerns I'm not gonna disclose the industry or name of the company): \(\newcommand{R}{\mathbb{R}} \newcommand{E}{\text{E}} \newcommand{bs}{\boldsymbol} \newcommand{N}{\mathbb{N}}\)
Assume \(k\), \(n\in\N\) and \(k < n\). For a uniformly chosen subspace \(\R^k\subsetneq\R^n\) we define the orthogonal projection as \(P:\R^n\mapsto\R^n\). Find \(\E[P(\bs{v})]\) where \(\bs{v}\in\R^n\) is given.
It's an interesting question and also a totally novel one to me at that time. How do we define a "uniformly" chosen subspace and its corresponding projection? What are the possible intuitions in this simple piece of question? Despite the busy schoolwork and student projects, these thoughts persist in my mind and drive me digging this question from time to time. Curiosity has been aroused and an appetite is meant to be satisfied.
In order to solve this problem, we need to fully understand what's been asked. So now we've got two nonnegative integers \(k<n\) and two spaces, namely \(\R^n\) and \(\R^k\). We know \(\R^k\) is somehow randomly selected as a subspace of \(\R^n\) and this randomness is uniform. For each of such selection, we can make an orthogonal projection of the given point^{[1]} \(\bs{v}\) onto \(\R^k\). Note here the projection is defined from \(\R^n\) to \(\R^n\), which means we're not interested in the projected value on \(\R^k\) but the projection itself. In other words, we're focusing on the projected vector's behavior in the same space as \(\bs{v}\) here.
The simplest example (well, it's in fact not THE simplest as we could always project \(\bs{v}\) onto \(\R^0\) and the resulting expectation would be a zero vector) would be \(n=2\) and \(k=1\). For any given \(\bs{v}\), we can always draw a graph as below.
The random subspace in this case is illustrated by the straight gray line, which determines the projection \(P(\bs{v})=\bs{h}\) as in the graph. We know we're now uniformly selecting this subspace when we rotate this line centered at the origin accordingly. This means the angle between \(\bs{v}\) and \(\bs{h}\), denoted by \(\theta\), is a uniform random variable on \([0, 2\pi)\). Further, simple geometry tells us the angle between \(\bs{v}\) and \(\bs{h}  \bs{v}/2\) is merely \(2\theta\), which is therefore, also uniformly distributed on \([0, 4\pi)\). Now that we know \(\bs{h}\) uniformly lies on the red circle, we conclude the expected projection, in this particular case, is \(\bs{v}/2\).
While it gets geometrically difficult to imagine, not to mention to draw, the case of larger \(n\) and \(k\), this example has given us a pretty nice guess:
\[\E[P(v)] = \frac{k}{n}\bs{v}.\]
Can we prove it in higher dimensions and general cases?
Proof. Now we try to prove that our previous statement is true. For any set of orthogonal bases^{[2]} \(\bs{e}=(\bs{e}_1,\bs{e}_2,\dots,\bs{e}_n)\in\R^{n\times n}\), we uniformly choose a subset \((\bs{e}_{n_1}, \bs{e}_{n_2},\dots,\bs{e}_{n_k})\) and define a subspace \(\R^k\) on them. The projected value on any basis \(\bs{e}_j\) is \(\bs{e}_j'\bs{v}\) and the corresponding vector component would be \(\bs{e}_j\bs{e}_j'\bs{v}\). Therefore, the orthogonal projection of \(\bs{v}\) is given by
\[P(\bs{v}) = \sum_{j=1}^{k}\bs{e}_{n_j}\bs{e}_{n_j}'\bs{v} = \bs{eDe'v} \in \R^n\]
where we define the random matrix \(\bs{D}\) to be a diagonal matrix with \(k\) ones and \((nk)\) zeros on its diagonal. The diagonal entries are not independent, but the expectation of each entry would be the same, namely \(k/n\). The expectation of the projection, therefore, is
\[\E[P(\bs{v})] = \E[\bs{eDe'v}] = \E\{\E[\bs{eDe'v}\mid \bs{e}]\} = \E\{\bs{e}\E[\bs{D}]\bs{e'v}\} = \frac{k}{n}\E[\bs{ee'}]\bs{v}.\]
where we used the tower rule^{[3]}. Now, notice for any \(\bs{e}\) it always holds that \(\bs{e'e}=\bs{I}\), we have
\[(\bs{ee'})^2 = \bs{ee'ee'} = \bs{e(e'e)e'} = \bs{eIe'} = \bs{ee'} \Rightarrow \bs{ee'} = \bs{I}\]
and thus we may finally conclude
\[\E[P(\bs{v})] = \frac{k}{n}\E[\bs{ee'}]\bs{v} = \frac{k}{n}\bs{v}\]
which exactly coincides with our previous guess.Q.E.D.
TBD. May concern dimensional reduction, etc.
These are the lecture notes on foreign exchange market and theories. \(\newcommand{\E}{\text{E}} \newcommand{\P}{\text{P}} \newcommand{\Q}{\text{Q}} \newcommand{\F}{\mathcal{F}} \newcommand{\d}{\text{d}} \newcommand{\N}{\mathcal{N}} \newcommand{\eeq}{\ \!=\mathrel{\mkern3mu}=\ \!} \newcommand{\eeeq}{\ \!=\mathrel{\mkern3mu}=\mathrel{\mkern3mu}=\ \!} \newcommand{\MGF}{\text{MGF}}\)
The spot price of a foreign currency is (LHS as units of foreign currency, RHS as of domestic currency) \(1 = S_t\). Which is equivalent to \(1/S_t = 1\). We say \(S_t\) is a price in domestic terms.
Selling domestic currency to buy foreign currencies.
Value for the buyer is (in domestic currency) \(PV=(S_t  R)N\). This is because of the two cash flows:
Executing a spot contract at time \(T\) with given contract rate \(R\).
Value for the buyer is (in domestic currency) \(PV=(S_t\cdot P^f  R\cdot P^d)N\). This is because of the two cash flows at time \(T\):
which has present values at time \(t\)
We set \(PV=0\) for the forward contract and get \(F\equiv R=S_t\cdot P^f /\ P^d=S_t\exp[(r^d  r^f)\cdot(Tt)]\). Therefore, we also have \(FS_t\approx S_t(r^d  r^f)\cdot(Tt)\).
In order to replicate a forward contract, we can execute a spot contract, borrow domestic and lend foreign. Namely, we have cash flows at time \(t\):
and at time \(T\):
This yields \(S_t/P^dF=F\cdot 1 / P^f\), or \(F=S_t\cdot P^f / P^d\), which is what we call the CIP. This means higher interest rate currencies will be "weaker" on a forward basis.
From the CIP we have \(P^f = P^d \cdot F/S_t\), which gives \(r^f = r^d  \log(F/S_t) / (Tt)\).
Swapping a forward contract (\(T_1\), \(R_1\)) for another (\(T_2\), \(R_2\)).
Value for the buyer is (in domestic currency) \[\begin{align*}PV&=(S_t\cdot P^{f1}  R_1\cdot P^{d1}  S_t\cdot P^{f2} + R_2\cdot P^{d2})\\&=\left\{S_t\left[\exp(r^{f1}(T_1t))  \exp(r^{f2}(T_2t))\right]  R_1\exp(r^{d1}(T_1t)) + R_1\exp(r^{d1}(T_1t))\right\}\cdot N\end{align*}\] which is rather insensitive w.r.t. the spot rate: \[PV_S = \frac{\partial PV}{\partial S} = (P^{f1}  P^{f2})N = \left[\exp(r^{f1}(T_1t))  \exp(r^{f2}(T_2t))\right]\cdot N\approx N r^f(T_2 T_1)\] compared with that of a forward contract: \[PV_S = P^f\cdot N = \exp[r^f(Tt)]\cdot N \approx N.\]
The right (but not obligation) to exhcange \(N\) units of foreign currency for \(N\cdot K\) units of domestic currency at time \(T\). This is to say, we call the right to buy foreign currency as a foreign call, but in the meantime, also a domestic put.
We have the putcall parity as \(CP=P^d(FK)\) and the payoff of a foreign call option, \(\max(0, S_TK)\). We assume \(\{S_t\}_{0\le t\le T}\) follows GBM \(\d S = \mu S \d t + \sigma S \d W\) which, according to Itô's lemma, gives \[\d V = \left(\frac{1}{2}\sigma^2S^2V_{SS} + V_t\right) \d t + V_S\d S\] where \(V\) is any derivative w.r.t. \(S\) (remark: remember that all subscript \(t\) here denote derivatives w.r.t. \(t\), not time). Now, noticing the hedged portfolio \(\Pi = \{+1 \text{ unit of }V; V_S \text{ units of } D^f\}\) has dynamics \[\begin{align*}\d\Pi &= \d V  V_S\d (S\cdot D^f) \\&= \left(\frac{1}{2}\sigma^2S^2V_{SS} + V_t\right) \d t + V_S\d S  V_S(D^f \d S + S\cdot r^f \d t) \\&= \left(\frac{1}{2}\sigma^2S^2V_{SS} + V_t  r^fV_S S\right) \d t\end{align*}\] where we used the fact that \(D^f(t)=1\). Now under riskneutral measure, we know \[\left(\frac{1}{2}\sigma^2S^2V_{SS} + V_t  r^fV_S S\right)\d t = r^d(V  V_S S)\d t\] which gives the socalled GarmanKohlhagen PDE: \[\frac{1}{2}\sigma^2S^2V_{SS} + (r^d  r^f)V_S S  r^d V + V_t = 0\] with boundary conditions \(V(S_T,T)=(S_TK)^+\) and \(V(0,T)=0\).
Trade date is when the terms of the transaction are agreed. Currency trading is a global, 24hour market. The "trading day" ends at 5pm New York time. Value date is when cash flows occur, i.e., when currencies are delivered. Value date for spot transactions is "T+2" for most currency pairs. However, spot value date is "T+1" for USD versus CAD, RUB, TRY, PHP.
Trade Date (T+0)  T+1  Value Date (T+2) 

Trade terms are agreed  Two currency payments are delivered  
Good day for CCY1 and CCY2 if nonUSD  Good day for CCY1 and CCY2  
Can be a USD holiday  Cannot be a USD holiday 
We usually call currency pairs (CCY1/CCY2, usually "/" is omitted) as any of the following:
CCY1  CCY2 

Base Currency  Terms Currency 
Fixed Currency  Variable Currency 
Home Currency  Overseas Currency 
When we say EURUSD \(= 1.1860\), we mean \(1\) EUR \(=\) \(1.1860\) USD.
In the context of bid offer spreads, we denote the bid and offer prices as EURUSD \(=1.1859/1.1860\) (or \(1.1859/60\) as shorthand). These spreads may vary. The possible reasons may involve liquidity, volatility and cost of risk.
In terms of USD, the direct quotes are CCYUSD and the indirect quotes are USDCCY.
Spot rates calculated from an indirect market, e.g. when EURUSD \(=1.1882\) and USDJPY \(=109.14\), then we have cross rate EURJPY \(=129.68\), which does not necessarily coincides with the actual rate in the market.
It's neither of the interest rates of the two currencies. Instead, people use the deposit rate in this case, specifically in terms of USD, it's the Eurodollar deposit rate, or LIBOR.
Contracts with any delivery date (a.k.a. value date) other than spot are considered forward. Standard delivery dates may be in weeks or months, and otherwise called "broken". Specifically, we call the contract "cash" if its delivery date is today, and "tom" if it's tomorrow. FX forwards are OTC (overthecounter).
We define: \(\text{forward point} = \text{forward rate (outright)}  \text{spot rate}\). The number is usually scaled by \(10^4\).
We have \[\text{forward} =\text{spot} \times \frac{1 + R_{\text{variable CCY}}\times\text{days}/ 360}{1 + R_{\text{fixed CCY}}\times\text{days}/360}\] where we use \(R\) instead of \(P\) here as it's more commonly given.
Using the CIP above, we have \[R^f = \frac{(S/F)\times(1 +R_d\times\text{days}/360)  1}{\text{days}/360}\] where we assume the rates are not compounded.
Contracts that alter the value date on an existing trade by simultaneously executing two forward transactions.
# of legs  FX Risk  IR Spread Risk  

Spot  1  Yes  No 
Forward  1  Yes  Yes 
Swap  2  No  Yes 
We define: \(\text{swap point} = \text{far rate}  \text{near rate}\).
To be continued.
]]>This is a brief selection of my notes on the stochastic calculus course. Content may be updated at times. \(\newcommand{\E}{\text{E}} \newcommand{\P}{\text{P}} \newcommand{\Q}{\text{Q}} \newcommand{\F}{\mathcal{F}} \newcommand{\d}{\text{d}} \newcommand{\N}{\mathcal{N}} \newcommand{\sgn}{\text{sgn}} \newcommand{\tr}{\text{tr}} \newcommand{\bs}{\boldsymbol} \newcommand{\eeq}{\ \!=\mathrel{\mkern3mu}=\ \!} \newcommand{\eeeq}{\ \!=\mathrel{\mkern3mu}=\mathrel{\mkern3mu}=\ \!} \newcommand{\R}{\mathbb{R}} \newcommand{\MGF}{\text{MGF}}\)
For \(X\sim\N(\mu,\sigma^2)\), we have \(\MGF(\theta)=\exp(\theta\mu + \theta^2\sigma^2/2)\). We have \(\E(X^k) = \MGF^{\ (k)}(0)\).
Consider a twosided truncation \((a,b)\) on \(\N(\mu,\sigma^2)\), then \[\E[X\mid a < X < b] = \mu  \sigma\frac{\phi(\alpha)  \phi(\beta)}{\Phi(\alpha)  \Phi(\beta)}\] where \(\alpha:=(a\mu)/\sigma\) and \(\beta:=(b\mu)/\sigma\).
Let \(X\) be a MG and \(T\) a stopping time, then \(\E X_{T\wedge n} = \E X_0\) for any \(n\).
Define \((Z\cdot X)_n:=\sum_{i=1}^n Z_i(X_i  X_{i1})\) where \(X\) is MG with \(X_0=0\) and \(Z_n\) is predictable and bounded, then \((Z\cdot X)\) is MG. If \(X\) is subMG, then also is \((Z\cdot X)\). Furthermore, if \(Z\in[0,1]\), then \(\E(Z\cdot X)\le \E X\).
If \(X\) is MG and \(\phi(\cdot)\) is a convex function, then \(\phi(X)\) is subMG.
Given \(\P\)measure, we define the likelihood ratio \(Z:=\d\Q / \d\P\) for another measure \(\Q\). Then we have
CASH
\(\P\) to STOCK
\(\Q\)measure): \(Z(\omega) = (\d\Q/\d\P)(\omega) = S_N(\omega) / S_0\).If \(B\) is a BM and \(T=\tau(\cdot)\) is a stopping time, then \(\{B_{t+T}  B_T\}_{t\ge T}\) is a BM indep. of \(\{B_t\}_{t\le T}\).
If \(B\) is a standard \(k\)BM and \(U\in\mathbb{R}^{k\times k}\) is orthogonal, then \(UB\) is also a standard \(k\)BM.
For any subMG \(X\), we have unique decomposition \(X=M+A\) where \(M_n:=X_0 + \sum_{i=1}^n [X_i  \E(X_i\mid \F_{i1})]\) is a martingale and \(A_n:=\sum_{i=1}^n[\E(X_i\mid \F_{i1})  X_{i1}]\) is a nondecreasing predictable sequence.
For BM \(B\) and stopping time \(T=\tau(a)\), define \(B^*\) s.t. \(B_t^*=B_t\) for all \(t\le T\) and \(B_t^* = 2a  B_t\) for all \(t>T\), then \(B^*\) is also a BM.
\(\P(\max_{s\le t}B_s > x\text{ and }B_t < y) = \Phi\!\left(\frac{y2x}{\sqrt{t}}\right)\).
Let \(X\) and \(Y\) be indep. BM. Note that for all \(t\ge 0\), from exponential MG we know \(\E[\exp(i\theta X_t)]=\exp(\theta^2 t/2)\). Now define \(T=\tau(a)\) for \(Y\) and we have \(\E[\exp(i\theta X_T)] = \E[\exp(\theta^2 T /2)]=\exp(\theta a)\), which is the Fourier transform of the Cauchy density \(f_a(x)=\frac{1}{\pi}\frac{a}{a^2+x^2}\).
We define Itô integral \(I_t(X) := \int_0^t\! X_s\d W_s\) where \(W_t\) is a standard Brownian process and \(X_t\) is adapted.
This is the direct result from the second martingality property above. Let \(X_t\) be nonrandom and continuously differentiable, then \[\E\!\left[\!\left(\int_0^t X_t\d W_t\right)^{\!\!2}\right] = \E\!\left[\int_0^t X_t^2\d t\right].\]
Let \(W_t\) be a standard Brownian motion and let \(f:\R\mapsto\R\) be a twicecontinously differentiable function s.t. \(f\), \(f'\) and \(f''\) are all bounded, then for all \(t>0\) we have \[\d f(W_t) = f'(W_t)\d W_t + \frac{1}{2}f''(W_t) \d t.\]
Let \(W_t\) be a standard Brownian motion and let \(f:[0,\infty)\times\R\mapsto\R\) be a twicecontinously differentiable function s.t. its partial derivatives are all bounded, then for all \(t>0\) we have \[\d f(t, W_t) = f_x\d W_t + \left(f_t + \frac{1}{2}f_{xx}\right) \d t.\]
The Wiener integral is a special case of Itô integral where \(f(t)\) is here a nonrandom function of \(t\). Variance of a Wiener integral can be derived using Itô isometry.
We say \(X_t\) is an Itô process if it satisfies \[\d X_t = Y_t\d W_t + Z_t\d t\] where \(Y_t\) and \(Z_t\) are adapted and \(\forall t\) \[\int_0^t\! \E Y_s^2\d s < \infty\quad\text{and}\quad\int_0^t\! \EZ_s\d s < \infty.\] The quadratic variation of \(X_t\) is \[[X,X]_t = \int_0^t\! Y_s^2\d s.\]
Assume \(X_t\) and \(Y_t\) are two Itô processes, then \[\frac{\d (XY)}{XY} = \frac{\d X}{X} + \frac{\d Y}{Y} + \frac{\d X\d Y}{XY}\] and \[\frac{\d (X/Y)}{X/Y} = \frac{\d X}{X}  \frac{\d Y}{Y} + \left(\frac{\d Y}{Y}\right)^{\!2}  \frac{\d X\d Y}{XY}.\]
A Brownian bridge is a continuoustime stochastic process \(X_t\) with both ends pinned: \(X_0=X_T=0\). The SDE is \[\d X_t = \frac{X_t}{1t}\d t + \d W_t\] which solves to \[X_t = W_t  \frac{t}{T}W_T.\]
Let \(X_t\) be an Itô process. Let \(u(t,x)\) be a twicecontinuously differentiable function with \(u\) and its partial derivatives bounded, then \[\d u(t, X_t) =\frac{\partial u}{\partial t}(t, X_t)\d t +\frac{\partial u}{\partial x}(t, X_t)\d X_t +\frac{1}{2}\frac{\partial^2 u}{\partial x^2}(t, X_t)\d [X,X]_t.\]
The OU process describes a stochastic process that has a tendency to return to an "equilibrium" position \(0\), with returning velocity proportional to its distance from the origin. It's given by SDE \[\d X_t = \alpha X_t \d t + \d W_t \Rightarrow\d [\exp(\alpha t)X_t] = \exp(\alpha t)\d W_t \] which solves to \[X_t = \exp(\alpha t)\left[X_0 + \int_0^t\! \exp(as)\d W_s\right].\]
Remark: In finance, the OU process is often called the Vasicek model.
The SDE for general diffusion process is \(\d X_t = \mu(X_t)\d t + \sigma(X_t)\d W_t\).
In order to find \(\P(X_T=B)\) where we define \(T=\inf\{t\ge 0: X_t=A\text{ or }B\}\), we consider a harmonic function \(f(x)\) s.t. \(f(X_t)\) is a MG. This gives ODE \[f'(x)\mu(x) + f''(x)\sigma^2(x)/2 = 0\Rightarrowf(x) = \int_A^x C_1\exp\left\{\!\int_A^z\frac{2\mu(y)}{\sigma^2(y)}\d y\right\}\d z + C_2\] where \(C_{1,2}\) are constants. Then since \(f(X_{T\wedge t})\) is a bounded MG, by Doob's identity we have \[\P(X_T=B) = \frac{f(X_0)  f(A)}{f(B)  f(A)}.\]
Let \(\bs{W_t}\) be a \(K\)dimensional standard Brownian motion. Let \(u:\R^K\mapsto \R\) be a \(C^2\) function with bounded first and second partial derivatives. Then \[\d u(\bs{W}_t) = \nabla u(\bs{W}_t)\cdot \d \bs{W}_t + \frac{1}{2}\tr[\Delta u(\bs{W}_t)] \d t\] where the gradient operator \(\nabla\) gives the vector of all first order partial derivatives, and the Laplace operator (or Laplacian) \(\Delta\equiv\nabla^2\) gives the vector of all second order partial derivatives.
If \(T\) is a stopping time for \(\bs{W_t}\), then for any fixed \(t\) we have \[\E[u(\bs{W}_{T\wedge t})] = u(\bs{0}) + \frac{1}{2}\E\!\left[\int_0^{T\wedge t}\!\!\Delta u(\bs{W}_s)\d s\right].\]
A \(C^2\) function \(u:\R^k\mapsto\R\) is said to be harmonic in a region \(\mathcal{U}\) if \(\Delta u(x) = 0\) for all \(x\in \mathcal{U}\). Examples are \(u(x,y)=2\log(r)\) and \(u(x,y,z)=1/r\) where \(r\) is defined as the norm.
Remark: \(f\) being a harmonic function is equivalent to \(f(X_t)\) being a MG, i.e. \(f'(x)\mu(x) + f''(x)\sigma^2(x)/2 = 0\) for a diffusion process \(X_t\).
Let \(u\) be harmonic in the an open region \(\mathcal{U}\) with compact support, and assume that \(u\) and its partials extend continuously to the boundary \(\partial \mathcal{U}\). Define \(T\) to be the first exit time of Brownian motion from \(\mathcal{U}\). for any \(\bs{x}\in\mathcal{U}\), let \(\E^{\bs{x}}\) be the expectation under measure \(\P^{\bs{x}}\) s.t. \(\bs{W}_t  \bs{x}\) is a \(K\)dimensional standard BM. Then
A multivariate Itô process is a continuoustime stochastic process \(X_t\in\R\) of the form \[X_t = X_0 + \int_0^t\! M_s \d s + \int_0^t\! \bs{N}_s\cdot \d \bs{W}_s\] where \(\bs{N}_t\) is an adapted \(\R^K\)−valued process and \(\bs{W}_t\) is a \(K\)−dimensional standard BM.
Let \(\bs{W}_t\in\R^K\) be a standard \(K\)−dimensional BM, and let \(\bs{X}_t\in\R^m\) be a vector of \(m\) multivariate Itô processes satisfying \[\d X_t^i = M_t^i\d t + \bs{N}_t^i\cdot \d \bs{W}_t.\] Then for any \(C^2\) function \(u:\R^m\mapsto\R\) with bounded first and second partial derivatives \[\d u(\bs{X}_t) = \nabla u(\bs{X}_t)\cdot \d \bs{X}_t + \frac{1}{2}\tr[\Delta u(\bs{X}_t)\cdot \d [\bs{X},\bs{X}]_t].\]
Let \(\bs{W}_t\) be a standard \(K\)−dimensional BM, and let \(\bs{U}_t\) be an adapted \(K\)−dimensional process satisfying \[{\bs{U}_t} = 1\quad\forall t\ge 0.\] Then we know the following \(1\)dimensional Itô process is a standard BM: \[X_t := \int_0^t\!\! \bs{U}_s\cdot \d W_s.\]
Let \(\bs{W}_t\) be a standard \(K\)−dimensional BM, and let \(R_t=\bs{W}_t\) be the corresponding radial process, then \(R_t\) is a Bessel process with parameter \((K1)\) given by \[\d R_t = \frac{K1}{R_t}\d t + \d W_t^{\sgn}\] where we define \(\d W_t^{\sgn} := \sgn(\bs{W}_t)\cdot \d \bs{W}_t\).
A Bessel process with parameter \(a\) is a stochastic process \(X_t\) given by \[\d X_t = \frac{a}{X_t}\d t+ \d W_t.\] Since this is just a special case of diffusion processes, we know the corresponding harmonic function is \(f(x)=C_1x^{2a+1} + C_2\), and the hitting probability is \[\P(X_T=B) = \frac{f(X_0)  f(A)}{f(B)  f(A)} =\begin{cases}1 & \text{if }a > 1/2,\\(x/B)^{12a} & \text{otherwise}.\end{cases}\]
Let \(W_t\) be a standard \(1\)dimensional Brownian motion and let \(\F_t\) be the \(\sigma\)−algebra of all events determined by the path \(\{W_s\}_{s\le t}\). If \(Y\) is any r.v. with mean \(0\) and finite variance that is measurable with respect to \(\F_t\), then for some \(t > 0\) \[Y = \int_0^t\! A_s\d W_s\] for some adapted process \(A_t\) that satisfies \[\E(Y^2) = \int_0^t\! \E(A_s^2)\d s.\] This theorem is of importance in finance because it implies that in the BlackSholes setting, every contingent CLAIM
can be hedged.
Special case: let \(Y_t=f(W_t)\) be any mean \(0\) r.v. with \(f\in C^2\). Let \(u(s,x):=\E[f(W_t)\mid W_s = x]\), then \[Y_t = f(W_t) = \int_0^t\! u_x(s,W_s)\d W_s.\]
CASH
with nonrandom rate of return \(r_t\)STOCK
with share price \(S_t\) such that \(\d S_t = S_t(\mu_t \d t + \sigma \d W_t)\)Under a riskneutral measure \(\P\), the discounted share price \(S_t / M_t\) is a martingale and thus \[\frac{S_t}{M_t} = \frac{S_0}{M_0}\exp\left\{\sigma W_t  \frac{\sigma^2t}{2}\right\}\] where we used the fact that \(\mu_t = r_t\) by the Fundamental Theorem.
A European contingent CLAIM
with expiration date \(T > 0\) and payoff function \(f:\R\mapsto\R\) is a tradeable asset that pays \(f(S_T)\) at time \(T\). By the Fundamental Theorem we know the discounted share price of this CLAIM
at any \(t\le T\) is \(\E[f(S_T)/M_T\mid \F_t]\). In order to calculate this conditional expectation, let \(g(W_t):= f(S_t)/M_t\), then by the Markov property of BM we know \(\E[g(W_T)\mid \F_t] = \E[g(W_t + W_{Tt}^*)\mid \F_t]\) where \(W_t\) is adapted in \(\F_t\) and independent of \(W_t^*\).
The discounted time−\(t\) price of a European contingent CLAIM
with expiration date \(T\) and payoff function \(f\) is \[\E[f(S_T)/M_T\mid \F_t] = \frac{1}{M_T}\E\!\left[f\!\left(S_t\exp\!\left\{\sigma W_{Tt}^*  \frac{\sigma^2(Tt)}{2} + R_T  R_t\right\}\right)\middle\F_t\right]\] where \(S_t\) is adapted in \(\F_t\) and independent of \(W_t^*\). The expectation is calculated using normal. Note here \(R_t = \int_0^t r_s\d s\) is the logcompound interest rate.
Under riskneutral probability measure, the discounted share price of CLAIM
is a martingale, i.e. it has no drift term. So we can differentiate \(M_t^{1}u(t,S_t)\) by Itô and derive the following PDE \[u_t(t,S_t) + r_t S_tu_x(t,S_t) + \frac{\sigma^2S_t^2}{2}u_{xx}(t,S_t) = r_t u(t,S_t)\] with terminal condition \(u(T,S_T)=f(S_T)\). Note here everything is under the BS model.
A replicating portfolio for a contingent CLAIM
in STOCK
and CASH
is given by \[V_t = \alpha_t M_t + \beta_t S_t\] where \(\alpha_t = [u(t,S_t)  S_t u_x(t,S_t)]/M_t\) and \(\beta_t = u_x(t,S_t)\).
A barrier option pays \(\$1\) at time \(T\) if \(\max_{t\le T} S_t \ge AS_0\) and \(\$0\) otherwise. This is a simple example of a pathdependent option. Other commonly used examples are knockins, knockouts, lookbacks and Asian options.
The time\(0\) price of such barrier options is calculated from \[\begin{align*}V_0 &= \exp(rT)\P\!\left(\max_{t\le T} S_t \ge AS_0\right)= \exp(rT)\P\!\left(\max_{t\le T} W_t + \mu t \ge a\right)\\&= \exp(rT)\P_{\mu}\!\left(\max_{t\le T} W_t \ge a\right)\end{align*}\] where \(\mu=r\sigma^{1}  \sigma/2\) and \(a = \sigma^{1}\log A\). Now, by CameronMartin we know \[\begin{align*}\P_{\mu}\!\left(\max_{t\le T} W_t \ge a\right) &=\E_0[Z_T\cdot \mathbf{1}_{\{\max_{t\le T} W_t\ge a\}}] =\E_0[\exp(\mu W_T  \mu^2 T / 2)\cdot \mathbf{1}_{\{\max_{t\le T} W_t\ge a\}}] \\ &=\exp( \mu^2 T / 2)\cdot \E_0[\exp(\mu W_T)\cdot \mathbf{1}_{\{\max_{t\le T} W_t\ge a\}}]\end{align*}\] and by reflection principle we have \[\begin{align*}\E_0[\exp(\mu W_T)\cdot \mathbf{1}_{\{\max_{t\le T} W_t\ge a\}}] &=e^{\mu a}\int_0^{\infty} (e^{\mu y} + e^{\mu y}) \P(W_T  a \in \d y) \\&=\Phi(\mu\sqrt{T}  a/\sqrt{T}) + e^{2\mu a}\Phi(\mu\sqrt{T}a/\sqrt{T}).\end{align*}\]
The exponential process \[Z_t = \exp\!\left\{\int_0^t\! Y_s\d W_s  \frac{1}{2}\int_0^t\! Y_s^2\d s\right\}\] is a positive MG given \[\E\!\left[\int_0^t\! Z_s^2Y_s^2\d s\right] < \infty.\] Specifically, the exponential martingale is given by the SDE \(\d X_t = \theta X_t \d W_t\).
Assume that under the probability measure \(\P\) the exponential process \(Z_t(Y)\) is a MG and \(W_t\) is a standard BM. Define the absolutely continuous probability measure \(Q\) on \(\F_t\) with likelihood ratio \(Z_t\), i.e. \((\d\Q/\d\P)_{\F_t} = Z_t\), then under \(Q\) the process \[W_t^* := W_t  \int_0^t\! Y_s\d s\] is a standard BM. Girsanov's Theorem shows that drift can be added or removed by change of measure.
The exponential process \[Z_t = \exp\!\left\{\int_0^t\! Y_s \d W_s  \frac{1}{2}\!\int_0^t\! Y_s^2 \d s\right\}\] is a MG given \[\E\left[\exp\!\left\{\frac{1}{2}\!\int_0^t\! Y_s^2\d s\right\}\right] < \infty.\] This theorem gives another way to show whether an exponential process is a MG.
Assume \(W_t\) is a standard BM under \(\P\), define likelihood ratio \(Z_t = (\d\Q/\d\P)_{\F_t}\) as above where \(Y_t = \alpha W_t\), then by Girsanov \(W_t\) under \(\Q\) is an OU process.
If a system can be in one of a collection of states \(\{\omega_i\}_{i\in\mathcal{I}}\), the probability of finding it in a particular state \(\omega_i\) is proportional to \(\exp\{H(\omega_i)/kT\}\) where \(k\) is Boltzmann's constant, \(T\) is temperature and \(H(\cdot)\) is energy.
If \(W_t\) is standard BM with \(W_0 = x \in (0, A)\), how does \(W_t\) behave conditional on the event that it hits \(A\) before \(0\)? Define
Then the likelihood ratios are \[\left(\frac{\d\Q^x}{\d\P^x}\right)_{\!\F_T} \!= \frac{\mathbf{1}_{\{W_T=A\}}}{\P^x\{W_T=x\}} \Rightarrow\left(\frac{\d\Q^x}{\d\P^x}\right)_{\!\F_{T\wedge t}} \!= \E\!\left[\left(\frac{\d\Q^x}{\d\P^x}\right)_{\!\F_T}\middle\F_{T\wedge t}\right] = \frac{W_{T\wedge t}}{x}.\] Notice \[\begin{align*}\frac{W_{T\wedge t}}{x} &=\exp\left\{\log W_{T\wedge t}\right\} / x \overset{\text{Itô}}{\eeq}\exp\left\{\log W_0 + \int_0^{T\wedge t}W_s^{1}\d W_s  \frac{1}{2}\int_0^{T\wedge t} W_s^{2}\d s\right\} / x \\&=\exp\left\{\int_0^{T\wedge t}W_s^{1}\d W_s  \frac{1}{2}\int_0^{T\wedge t} W_s^{2}\d s\right\}\end{align*}\] which is a Girsanov likelihood ratio, so we conclude \(W_t\) is a BM under \(\Q^x\) with drift \(W_t^{1}\), or equivalently \[W_t^* = W_t  \int_0^{T\wedge t}W_s^{1}\d s\] is a standard BM with initial point \(W_0^* = x\).
A onedimensional Lévy process is a continuoustime random process \(\{X_t\}_{t\ge 0}\) with \(X_0=0\) and i.i.d. increments. Lévy processes are defined to be a.s. right continuous with left limits.
Remark: Brownian motion is the only Lévy process with continuous paths.
Let \(B_t\) be a standard BM. Define the FPT process as \(\tau_x = \inf\{t\ge 0: B_t \ge x\}\). Then \(\{\tau_{x}\}_{x\ge 0}\) is a Lévy process called the onesided stable\(1/2\) process. Specifically, the sample paths \(x\mapsto \tau_x\) is nondecreasing in \(x\). Such Lévy processes with nondecreasing paths are called subordinators.
A Poisson process with rate (or intensity) \(\lambda > 0\) is a Lévy process \(N_t\) such that for any \(t\ge 0\) the distribution of the random variable \(N_t\) is the Poisson distribution with mean \(\lambda t\). Thus, for any \(k=0,1,2,\cdots\) we have \(\P(N_t=k) = (\lambda t)^k\exp(\lambda t)\ /\ k!\) for all \(t > 0\).
Remark 1: (Superposition Theorem) If \(N_t\) and \(M_t\) are independent Poisson processes of rates \(\lambda\) and \(\mu\) respectively, then the superposition \(N_t + M_t\) is a Poisson process of rate \(\lambda+\mu\).
Remark 2: (Exponential Interval) Successive intervals are i.i.d. exponential r.v.s. with common mean \(1/\lambda\).
Remark 3: (Thinning Property) Bernoulli\(p\) r.v.s. by Poisson\(\lambda\) compounding is again Poisson with rate \(\lambda p\).
Remark 4: (Compounding) Every compound Poisson process is a Lévy process. We call the \(\lambda F\) the Lévy measure where \(F\) is the compounding distribution.
For \(N\sim\text{Pois}(\lambda)\), we have \(\MGF(\theta)=\exp[\lambda (e^{\theta}1)]\).
For \(X_t=\sum_{i=1}^{N_t}\!Y_i\) where \(N_t\sim\text{Pois}(\lambda t)\) and \(\MGF_Y(\theta) = \psi(\theta) < \infty\), then \(\MGF_{X_t}(\theta)=\exp[\lambda t (\psi(\theta)  1)]\).
Binomial\((n,p_n)\) distribution, where \(n\to\infty\) and \(p_n\to 0\) s.t. \(np_n\to\lambda > 0\), converges to Poisson\(\lambda\) distribution.
If \(N_t\) is a Poisson process with rate \(\lambda\), then \(Z_t=\exp[\theta N_t  (e^{\theta}  1) \lambda t]\) is a martingale for any \(\theta\in\R\).
Remark: Similar to CameronMartin, let \(N_t\) be a Poisson process with rate \(\lambda\) under \(\P\), let \(\Q\) be the measure s.t. the likelihood ratio \((\d\Q/\d\P)_{\F_t}=Z_t\) is defined as above, then \(N_t\) under \(\Q\) is a Poisson process with rate \(\lambda e^{\theta}\).
If \(X_t\) is a compound Poisson process with Lévy measure \(\lambda F\). Let the MGF of compounding distribution \(F\) be \(\psi(\theta)\), then \(Z_t=\exp[\theta X_t  (\psi(\theta)  1)\lambda t]\) is a martingale for any \(\theta\in\R\).
A \(K\)dimensional Lévy process is a continuoustime random process \(\{\bs{X}_t\}_{t\ge 0}\) with \(\bs{X}_0=\bs{0}\) and i.i.d. increments. Like the onedimensional version, vector Lévy processes are defined to be a.s. right continuous with left limits.
Remark: Given nonrandom linear transform \(F:\R^K\mapsto \R^M\) and a \(K\)dimensional Lévy process \(\{\bs{X}_t\}_{t\ge 0}\), then \(\{F(\bs{X}_t)\}_{t\ge 0}\) is a Lévy process on \(\R^M\).
]]>I am recently playing a billiard game where you can play a series of exciting tournaments. In each tournament, you pay an entrance fee of, for example, \(\$500\), to potentially win a prize of, say, \(\$2500\). There are various kinds of tournaments with different entrance fees ranging from \(\$100\) up to over \(\$10000\). After hundreds of games, my winning rate stablized around \(58\%\), which is actually pretty good as it significantly beats random draws. A natural concept therefore came into my mind: Is there an optimal strategy?
Well, I think so. I'll list two strategies below and try to explore any potential optimality. We can reasonably model these tournaments as repetitive betting with certain fixed physical probability \(p\) of winning and odds^{[1]} of \((d1)\):\(1\) against ourselves. Given that there are sufficiently sparse tournament entrance fees, we may model these fees as a real variable \(x\in\mathbb{R}_+\) to maximize our long run profitability. Without loss of generality, let's assume an initial balance of \(M_0=0\) and that money in this world is infinitely divisible. The problem then becomes determination of the optimal \(x\in[0,1]\) s.t. the expected return is maximized. Nonetheless, regarding different interpretations of this problem we have several solutions. Some are intriguing while others may be frustrating.
Let's first take a look at potential values of \(x\) and the corresponding balance trajectories \(M_t\). For any \(0 \le x \le 1\), we have probability \(p\) to get an \(x\)fraction of our whole balance \(D\)folded and \(1p\) to lose it, that is \[\text{E}(M_{t+1}\mid\mathcal{F}_t) = (1x)M_t + p\cdot xdM_t + (1p)\cdot 0 =[1 + (pd1)x] M_t\] which indicates \(M_t\) is a submartingale^{[2]} as in this particular case, \(p=0.58\), \(d=5\) and thus \(pd=2.9 > 1\). So the optimal fraction is \(x^* = 1\), which is rather aggresive and yields a ruin probability of \(1p^n\) for the first \(n\) bets. Simulation supports our worries: not once did we survived \(10\) bets in this tournament, and the maximum we ever attained is less than a million.
If consider \(\log M_t\) instead, then \[\begin{align*}\text{E}(\log M_t\mid \mathcal{F}_t) &=p\cdot \log[(1x)M_t + xdM_t] +(1p)\cdot \log[(1x)M_t + 0]\\ &=p\cdot \log[(1(1d)x)M_t] +(1p)\cdot \log[(1x)M_t].\end{align*}\] The first order condition is \[\frac{\partial}{\partial x}\text{E}(\log M_t\mid \mathcal{F}_t) =\frac{p(1d)}{1(1d)x}+\frac{1p}{1x} = 0 \quad\Rightarrow\quadx^* = \frac{pd1}{d1}=0.475\] which is more conservative and therefore, should survive longer than the previous strategy. Simulation gives the following trajectories: even the worst sim beat the best we got when \(x=1\).
According to Doob's martingale inequality^{[3]}, the probability of our balance ever attaining a value no less than \(C = 1\times10^{60}\) in \(T=500\) steps is \[\text{P}\left(\sup_{t \le T}M_t\ge C\right) \le \frac{\text{E}(M_T)}{C} = \frac{M_0}{C} \prod_{t=0}^{T1}\frac{\text{E}(M_{t+1}\mid\mathcal{F}_t)}{M_t} =\frac{[1+(pd1)x]^T}{C} \approx 4.6\times10^{139} \gg 1.\] This implies the superior limit of the probability that our balance exceeds \(1\times10^{60}\) within \(500\) steps is one (instead of what simulation gave us, which is around \(0.31\)). To put it differently, we actually might be able to find a certain strategy that is even significantly better than the one given by the Kelly criterion.
What is it, then? Or, does it actually exist? I don't have an idea yet, but perhaps exploratory algorithms like machine learning will give us some hints, and perhaps the strategy is not static but rather dynamic.
I've recently sold my Nvidia GTX 1080 eGPU^{[1]} after two month's waiting in vain for a compatible Nvidia video driver for MacOS 10.14 (Mojave). Either Apple's or Nvidia's fault, I don't care any more. Right away, I ordered an AMD Radeon RX Vega 64 on Newegg. The card arrived two days later and it looked sexy at first sight. It's plugandplay as expected and performed just as good as its predecessor, regardless of serious gaming, video editing or whatever. I would have given it a 9.5/10 if not find another issue a couple of days later — wow, there is no CUDA on this card!
Of course there isn't. Cause CUDA was developed by Nvidia who's been paying great efforts on making a more userfriendly deeplearning environment. Compared with that, AMD (yes!) used to intentionally avoid a headtohead competition against world's largest GPU factory and instead keep making gaming cards with better costtoperformance ratios. ROCm, which is an opensource HPC/Hyperscaleclass platform for GPU computing that allows cards other than Nvidia's, does make this gap much narrower than before. However, ROCm is still publicly not supporting MacOS and you have to run a Linux bootcamp to utilize the computational benefits of your AMD card, even though you can already game smoothly on you Mac. Sad it is, AMD 😰.
There are, however, several solutions if you're people just like me who really have to run your code on a Mac and would like to accelerate those Renaissance training times with a GPU. The method I adapted was by using a framework called PlaidML, and I'd like to walk you through how I installed, and configured my GPU with it.
1  pip3 install plaidmlkeras plaidbench 
After installation, we can set up the intended device for computing by running:
1  plaidmlsetup 
PlaidML Setup (0.3.5)Thanks for using PlaidML!Some Notes: * Bugs and other issues: https://github.com/plaidml/plaidml * Questions: https://stackoverflow.com/questions/tagged/plaidml * Say hello: https://groups.google.com/forum/#!forum/plaidmldev * PlaidML is licensed under the GNU AGPLv3 Default Config Devices: No devices.Experimental Config Devices: llvm_cpu.0 : CPU (LLVM) opencl_intel_intel(r)_iris(tm)_plus_graphics_655.0 : Intel Inc. Intel(R) Iris(TM) Plus Graphics 655 (OpenCL) opencl_cpu.0 : Intel CPU (OpenCL) opencl_amd_amd_radeon_rx_vega_64_compute_engine.0 : AMD AMD Radeon RX Vega 64 Compute Engine (OpenCL) metal_intel(r)_iris(tm)_plus_graphics_655.0 : Intel(R) Iris(TM) Plus Graphics 655 (Metal) metal_amd_radeon_rx_vega_64.0 : AMD Radeon RX Vega 64 (Metal)Using experimental devices can cause poor performance, crashes, and other nastiness.Enable experimental device support? (y,n)[n]:
Of course we enter y
. Before I choose device 4 (OpenCL with AMD) or 6 (Metal with AMD), I'd like to benchmark on the default device, CPU (LLVM). The test script (on MobileNet as an example) is
1  plaidbench keras mobilenet 
and the result shows^{[2]}
Running 1024 examples with mobilenet, batch size 1INFO:plaidml:Opening device "llvm_cpu.0"Downloading data from https://github.com/fchollet/deeplearningmodels/releases/download/v0.6/mobilenet_1_0_224_tf.h517227776/17225924 [==============================]  2s 0us/stepModel loaded.Compiling network...Warming up ...Main timingExample finished, elapsed: 3.0688607692718506 (compile), 61.17863607406616 (execution), 0.059744761791080236 (execution per example)Correctness: PASS, max_error: 1.7511049009044655e05, max_abs_error: 6.556510925292969e07, fail_ratio: 0.0
Now we run the setup again and choose 4 (OpenCL with AMD). The result is
Running 1024 examples with mobilenet, batch size 1INFO:plaidml:Opening device "opencl_amd_amd_radeon_rx_vega_64_compute_engine.0"Model loaded.Compiling network...Warming up ...Main timingExample finished, elapsed: 2.6935510635375977 (compile), 13.741217851638794 (execution), 0.01341915805824101 (execution per example)Correctness: PASS, max_error: 1.7511049009044655e05, max_abs_error: 1.1995434761047363e06, fail_ratio: 0.0
Finally we run the test against the expected most powerful device, i.e. device 6 (Metal with AMD).
Running 1024 examples with mobilenet, batch size 1INFO:plaidml:Opening device "metal_amd_radeon_rx_vega_64.0"Model loaded.Compiling network...Warming up ...Main timingExample finished, elapsed: 2.243159055709839 (compile), 7.515545129776001 (execution), 0.007339399540796876 (execution per example)Correctness: PASS, max_error: 1.7974503862205893e05, max_abs_error: 1.0952353477478027e06, fail_ratio: 0.0
As a conclusion, by utilizing the Metal core on my Mac as well as the external AMD GPU, the training runtime was roughly 87.7% down and I'm personally quite satisfied with that.
]]>It's been more than two years since my last trip to the Arctic Circle when I was still studying in the Netherlands. Our adventurous hike in Abisko, in endless Northern European Mountains, was still a frequent dream of mine. This time we went to Fairbanks, Alaska, for Aurora and also, for another Arctic experience.
We spent five days in Fairbanks, five days and six nights. Apart from the two simple dinners we took on the hike and one with beef noodles at the arctic circle camp, the Pump House ended up as the very choice of our bestrecommended feasts. It is a fine dining restaurant and probably the best in the town, as you get a thumbup to this little house from nearly every local you meet. It is definitely one of the most enjoyable moments throughout our stay in Fairbanks, most warming, relaxing and tastebudexciting.
其实一开始的时候我们还打算去更远的一家 Turtle Club 吃吃看，毕竟那家也在 Yelp 上评分颇高。导游也曾推荐我们去试一下当地的一家（华人开的）自助，叫 AK Buffet，所幸最后有一餐午饭跟团去了那里——吃完大失所望，于是又暗自庆幸我们没有在之前选择浪费一餐在那里。最后的结果即是上面说的：在 the Pump House 吃了整整三天，把菜单上的推荐菜几乎完完整整点了一圈。三天下来，感觉生蚝与牛排不及预期，但海鲜名副其实。其中最为推荐的是他家的海鲜浓汤 Seafood Chowder，单点或上整个 bread bowl 都可以。汤头很浓郁，且与一般的海鲜浓汤不一样的是，他在奶味外还有一种类似鸡汤的鲜味。满满一勺入口，这种鲜味与各种海鲜的口感直接冲击味蕾，裹挟着一天的疲惫融成温暖、慰藉与幸福。这大概是我眼中一碗海鲜浓汤的全部意义了。相似的感觉也出现在他家的 Seafood Risotto 上。虾肉 Q 弹、带子紧实、蟹柳饱满。更重要的是米饭熟度恰到好处，不过分奶腻，也未过分夹生，可以说比肩甚至超越了我在欧洲吃到过的最好的 risotto 了。除开这两样，他家力推的 King Salmon 口感尚可，相比之下或许 Alaskan Halibut 煎制得更鲜美（至少，好于同页的 Alaskan Cod）。最后，自然不可避嫌地需要提一提他家的 Steamed King Crab——新鲜捕捞的阿拉斯加帝王蟹^{[1]}被用于最能保留海鲜风味的蒸制，不加任何佐料，甚至连余温都未有散去就上桌呈现给食客们。用专门的钳子打开，蟹肉温软而有弹性，散发着清蒸海鲜特有的香气。整一根蟹腿在手，甚至满满一口有多，这种满足感是阿拉斯加之外的任何地方不能给予的。
他家进门处立着一个一人多高的棕熊标本^{[2]}，憨态可掬，可惜去了三次都忘记拍照。我们坐了两次的座位上方，悬着一匹巨大的驼鹿标本，初见颇有压迫感，去过一次再看，倒也很具情调。The Pump House 地处 Fairbanks 最大的河流 Chena River 一岸，白天的景致据说很美，在夏季开放的露天座位据说又是另一番风情，这些我们都无从知晓了。最后从网上找来一张他家入口处的照片（由于我们去时都是晚上，所以一张都没拍），以飨各位。
I'm trying to write a chatroom in this post, using the socket
package^{[1]} in Python.
The general structure of this problem can be devided into three parts. In the simplest case we have two clients, namely client0
and client1
, and a server. Except for that the server provides the interface, everything else will remain the same among these three classes: they inherit from the class socket.socket
and have two methods sending
and recving
. The two methods are built to loop infinitely just so that all requests are accepted unattended. In the meantime, in order to avoid interruption between these two functions, we have to run them simultaneously using the threading
package. The two clients are reporting to different ports of the same host and the server listens to both, also in an infinite loop.
The code for server.py
:
1  import socket 
The code for client0.py
:
1  import socket 
The code for client1.py
:
1  import socket 
Start server.py
first and then the two clients. The terminal screenshot is as below.
Again, this is just a very simple, toylike chatroom and there're a lot to be implemented if you want to, like quiting schemes, frontend delivery and broadcasting in multiclient cases. However, I'm sure taking this as the starting point won't hurt. Enjoy coding!
socketserver
or something else. They may be convenient, but also sometimes redundant. ↩︎今天是黑五，在逛街。和国内几天前疯 hùn 狂 luàn 的双十一相比，芝加哥的节日气氛倒给人一种更亲切的感觉。圣诞的华彩已不知什么时候在街角巷落升起，白色的灯带沿着路的两旁一弯一弯地荡过去，一直到看不见的地方。的确，网上的促销没有浇灭美国人出门逛街的热情，Michigan Ave 上人头攒动。往来擦肩的，绝大多数是紧挽着手的男女，在风城的角角落落飘摇着、依偎着，像某天梦里雪原的驯鹿。不时也会遇到一家几口的，爸爸或妈妈轻牵着孩子的手，稚嫩的眼神与我不期而遇。除此以外，还碰到了一个在地铁口吹了一天爵士的黑人老哥。他坐在板凳上一曲一曲地不停演奏了整整一天，也并不和路人有什么交流。人来人往，单簧管的音色和夜色溶在了一起。
逛街最大的乐趣在于逛，其次才是买买买，而其中逛的部分又由吃喝玩乐一起呈现。吃、喝、玩、乐，语言的顺序暗示了它们在人心里的地位。对我来说也是一样——吃必是第一位的。今天逛街开心的事情有好多，但如若要我排幸福来源之 top 3 的话，两餐美食一定在其中。考虑到离 AMC 的距离和牛角又双叒叕一次被订满了，我最后选择了这家七百多评的 Niu Japanese Fusion Lounge（就在 AMC 楼下！这种 bug 一样的地理位置决定了我以后肯定会再来）。他家主打的是各色新鲜的巻き和握り，在 Yelp 和 OpenTable 上有不少好评，但最后我还是毅然决然地选择了刺身和拉面——如果一家日料店的寿司很出色，在价格可以接受的情况下，它的刺身一定会给我带来更爆裂的幸福感；而在诸多刺身之中，几个月不知鱼滋味的我会做出的选择肯定是三文鱼 and 三文鱼 only。事实上，这家 4.5 星、连队都不用排的小店完全没有辜负我对它的厚望。厚切的三文鱼饱满而新鲜，肉质紧实，纹理清晰，刀法干净。凑近闻不但几无腥味，反而还有淡淡的清新。入口，大块的鱼肉像橘红色的果冻滑至齿间，又像黄油一样迅速融化消解。22 刀一份的刺身足足有 9 片，又是厚切，但转眼就一干二净，徒留人坐在那里举箸四望怅然若失。回过头来一想，说明的确好吃。
几乎就是三文鱼吃完前一分钟，热腾腾的拉面就上来了……不好吃，别吃，period。
再早一些，中午的时候，去拔草了心水已久的老四川 (Lao Sze Chuan) 的烤鸭。不知道是这家店本来就格外热门还是因为感恩节假期（亦或两者兼有）的缘故，排队的人特别特别多。但前台把各人的名字和号码在纸上一个一个记了下来，所以相当于可以取号先离开，倒也很不错。他家的烤鸭与国内一样，可以一只或半只买。一两个人吃半只足矣，三四好友来的话，推荐点整只的。鸭子是现烤现片。师傅就站在门厅中间，有条不紊地一直片着他的鸭子，足量了，身边的帮工就会给某一桌上一盘，随带的有四色佐料和新蒸好的一客面皮。叫人着实惊喜的是，几分钟后他们又附赠了一份鸭肉熬的高汤。沉浸在烤鸭的幸福中的我面对鲜美的鸭汤几乎要落下泪来。鸭子不肥不瘦，酥皮不油不干，鸭汤不咸不腻，老四川不虚此行。
今天吃到这么多好吃的，很幸福。
逛街的时候偶然看到好大好大的一棵圣诞树，上面星星点点的，感觉很幸福。
吃完去看了《无名之辈》，笑得下巴要脱臼，最后又有点想哭，想来想去，还是很幸福。
这样就很好吧。
]]>This is a simple print
function overwritten so that you can specify different colors in Terminal outputs.
To use this feature, you'll need to import this customized print
function from the ColorPrint
package, the GitHub repo is here.
1  from ColorPrint import print 
and the output is as below (in Terminal):
I find this especially useful when you're trying to focus on commandline workflow only and don't want to build you own wheel over and over again.
]]>This is the first post of my ambitious plan trying to enumerate as many key points about the C++ language as I can. These notes are only for personal reviewing purposes and shall definitely be used commercially by anyone interested. Just please comment below for any missing C++ syntax or features. 👍🏻
Basically there're only one thing that needs attention. For standard libraries we use <
and >
and for local libraries we use quotes.
1 
These are files we declare functions and classes we want to use or implement in main files.
In C++, by loading the iostream
library we can read and write by
1 

Normally libraries come with classes, e.g. for iostream
we need to use std
everytime we need to print something. With namespaces we can reduce redundance.
1 

We may also use using namespace std;
which sometimes can cause problems.
There are a variety of data types in C++. For real numbers we have
Type  Bytes  Range 

float  \(4\)  \(\pm 3.4E\pm38\) 
double  \(8\)  \(\pm 1.7E\pm308\) 
where we should pay enough attention to the \(\pm\). For general integers we have
Type  Bytes  Range 

short  \(2\)  \(2E15\) to \(2E151\) 
int  \(4\)  \(2E31\) to \(2E311\) 
long  \(8\)  \(2E63\) to \(2E631\) 
and for each type we also have an unsigned version that starts from 0 and covers the same length of range.
We may notice that long
has a smaller range than float
despite the fact that the first data type actually costs more bytes than the latter. This is because the 4 bytes (or 32 bits) of a float
\(V\) is not stored equally in RAM, but rather
\[V = (1)^S \cdot M \cdot 2^E\]
where \(S\) is the first bit, and \(E\) the second through the ninth bits, and \(M\) for the tenth and so forth. So in a sense, because float
is more "sparse", the long
type has a smaller range.
Apart from other fundamental types like char
and bool
, we can also define our own data types or use types defined in libraries, e.g. std::string
. We may also use type aliases like
1  typedef double OptionPrice; 
We have operators for fundamental types:
Function  Operator 

assignment  = 
arithmetic  +  * / 
comparison  > < <= >= 
equality/nonequality  == != 
logical  &&  
modulo  % 
In C++ there're a set of shortcuts as follows:
Full Operator  Shortcut 

i = i + 1;  i++; i += 1; 
i = i  1;  i; i = 1; 
i = i * 1;  i *= 1; 
i = i / 1;  i /= 1; 
We may also use prefix and postfix in assignment, which are totally different. After
1  int x = 3; 
we have \(x = 4\) and \(y = 3\). After
1  int x = 3; 
we have \(x = y = 3\).
A general template for a C++ function:
1  resType f(argType1 arg1, argType2 arg2, ...) { 
Notice we may write multiple functions with the same resType
but with different arguments, which we call "parameter overloading". Meanwhile, even withou parameter overloading we can still use a function of double
on int
, because int
takes up less bytes and the implicit conversion is safe. We call it widening or promotion. In contrast, narrowing can be dangerous and cause a build warning.
In C++ we have two kinds of comments.
1  // This is inline comment 
In C++, people usually pass variables into functions by two methods: either by value or by reference. The first way creates a copy of the variable and nothing will happen to the original one. For the second, anything we do in the function will take effect on the original variable itself. The original variable must be declared once we create a reference, so
1  int x = 1; 
will compile, while below will not:
1  int x = 1; 
References can be extremely useful, especially when the original variable is a large object and making a copy costs considerable time and memory. However, this is potentially risky when we don't want to mess up with the original object when calling a function. So we need const references.
There're two situations we should take care when using the const
keyword with references. First, we can make a reference of a const variable, and we cannot change the value of it:
1  const int x = 1; 
We may also bind a const reference to a variable when the original itself is not const:
1  int y = 1; 
In this case we avoid making a copy while also keep the original variable safe from unexpected editing.
There is a third way of passing a variable, that is pointers. Pointers are variables the points to their addresses in memory. We declare a pointer by
1  int* pi; // legal but bad without initialization 
which comes with two unique operators: &
for the address of a variable, and \(*\) for the dereference of a pointer.
1  int i = 123; 
123
You can create a pointer pointing to a piece of dynamic memory for later deletion, in case memory is being an issue in your program.
1  int *p = new int; 
You can have pointers to a const variable, i.e. you cannot change its value through pointers.
1  const int x = 1; 
You can also have const pointers to variables, then you can change the value of the variable but never again the pointer (address) itself.
1  int x = 1; 
You can also have const pointers to const variables.
1  const int x = 1; 
Below is a general template for if/else
structures in C++.
1  if (condition1) { 
When there're multiple conditions, we can also use the switch
keyword.
1  switch (expression) { 
One of the most popular loops is while
loop.
1  while (condition) { 
It also has a variant called the do/while
loop.
1  do { 
which is slightly different from the while
loop in sequence.
Another form of loop that keeps track of the iterator precisely.
1  for (initializer; condition; statement1) { 
There is an unwritten rule that we usually write ++i
in statement1
because compared with i++
which need to make a copy, ++i
is more efficient. However, it's arguably correct because modern compilers can surely optimize this defect.
A simple but intuitive example of classes is to describe a people in C++ (here we assume type string
under the namespace std
is used):
1  class Person { 
We can also implement member functions in the class, just to make it more convenient:
1  class Person { 
We have three levels of data protection in a class:
This means we can protect data in the class by declaring them as private while get access to them via public member functions:
1  class Person { 
An instance created based on a class is called an object. To create an object, we may need a constructor, a copy constructor and a destructor.
1  class Person { 
Person
name_
.According to this coding style we have in Person.h
1 

In Person.cpp
we implement the member functions of the class:
1  string Person::GetEmail() { 
Just keep in mind that the constructors as well as the destructor should also be implemented:
1  Person::Person() { 
We can also use the colon syntax for constructors:
1  Person::Person() : name_(""), email_(""), stu_id_(0) {} 
a struct
is a class
with one difference: struct
members are by default public, while class
members are by default private.
1  struct Person { 
For a newly created class we cannot use person2 = person1
if we want to assign the whole object person1
to person2
. We have to use constructors. What we can do, instead, is to overload these operators (e.g. the assignment operator =
) specifically for the class.
The overloadable operators include +

*
/
%
^
&

~
!
=
<
>
<=
>=
++

<<
>>
==
!=
&&

+=
=
*=
/=
&=
=
^=
%=
<<=
>>=
[]
()
>
>*
new
new[]
delete
delete[]
.
The nonoverloadable operators are ::
.*
.
?=
.
1  void Person::operator=(const Person& another_person) { 
However, such overloading does not support chain assignment like person3 = person2 = person1
. We need to return a reference in order to support that.
1  Person& Person::operator=(const Person& another_person) { 
where this
is a pointer pointing to the object itself.
Another concern is selfassignment, which in some cases can be dangerous and in almost every situation is inefficient. To avoid selfassignment we need to detect and skip it.
1  Person& Person::operator=(const Person& another_person) { 
In C++, a function can only be defined once. This is called the One Definition Rule (ODR). To avoid multiple including of the header files, we use include guards. This is being done by defining a macro at the beginning of each header file.
1 

Here we introduce two of the most useful containers in the C++ Standard Library: std::vector
and std::map
. To initialize an empty vector, we use
1 

and to initialize with a specific size, we do
1 

On the other hand, map
containers are like dict
in Python, which allows you to use indiced of any type, e.g. std::string
.
1 

Data abstraction refers to the separation of interface (public functions of the class) and implementation:
Encapsulation refers to combining data and functions inside a class so that data is only accessed through the functions in the class.
We can declare friend
a function or class s.t. they can get access to the private and protected members of the base class.
1  class MyClass { 
and you can implement and use this function change_data
globally in the function to change my_data
.
Inheritance refers to based on the existing classes trying to:
A simple example would be
1  class Student { 
with meanwhile
1  class Employee { 
Apparently a lot of functions and data are repeated. What we're gonna do is to build a base class and reuse it onto two derived classes. Note:
In actual coding, this is what we do:
1  class Person { 
with
1  class Student : public Person { 
and
1  class Employee : public Person { 
To initialize a base class, we define constructors just like what we did before:
1  Person::Person(string name, string email) : name_(name), email_(email) {} 
while for derived classes, we need to call the base class constructor
1  Student::Student(string name, string email, string major) : Person(name, email), major_(major) {} 
A derived class can access members in the base class, subject to protection level restrictions. Protection levels public and private have their regular meanings in an inheritance class hierarchy:  A derived class cannot access private members of a base class.  A derived class can access public members of a base class.
A derived class can also access protected members of a base class. If a class has protected members:  That class can access them  A derived class of that class can access them  Everone else cannot access them
A base class uses the virtual
keyword to allow a derived class to override (provide a different implementation) a member function. If a function is virtual (in the base class):  The base class provides an implementation for that function; we call it the default implementation  A derived classes inherit the function interface (definition) as well as the default implementation  A derived class can provide a different implementation for that function (but it does not have to)
1  class Base1 { 
and then functions like Fun1
will be revisable in inheritance. Note that the base class has to implement all functions no matter they're virtual or not.
If we don't give a default implementation of a virtual function, we call it pure virtual. This is been done by assigning =0
at the time of definition.
1  class Base2 { 
In this case the base class does not need to implement this Fun1
and in contrast, the derived class must do so. A class with virtual functions is called an abstract class. Note that we cannot instantiate (make an object of) an abstract class until every virtual function is implemented.
There's a slight difference between normal member functions, virutal functions and pure virtual functions during inheritance.
We use a pointer or a reference to a base class object to point to an object of a derived class, which we call the Liskov Substitution Principle (LSP).
1  Option* option1 = nullptr; 
More direct example may be as follows. Instead of writing separately
1  double Price(EuropeanCall option, ...) { 
we can use polymorphism and write it w.r.t. the base class Option
using a reference or pointer
1  double Price(Option& option, ...) { 
For variables we declare constancy by
1  const int val = 10; 
For constant objects, e.g.
1  class Student { 
when we call
1  const Student a('Allen', 'allen@gmail.com'); 
we meed a compile error. This is due to that the compiler does not know the function GetName
is constant. To declare that we need
1  class Student { 
When we have pure virtual constant member functions, we write like this: virtual type f(...) const = 0
.
A const member function cannot modify data members. The only exception of this issue is mutable data members.
1  class Student { 
The override
keyword serves two purposes:
1  class base { 
In implementation of the pure virtual function foo
in derived class derived1
, we're doing just as told by the base class. In derive2
, with the override
keyword we'll get an error for overwriting the original virtual function by changing types; while without this keyword we'll get at most just a warning.
For nonstatic member we change an instance's data and it's done. Nothing will happen to other instances of the same derived class. For a static member function/data the association is built and we can change one and for all.
1  class Counter { 
A regular function is generally
1  int AddOne(int x) {return x + 1;} 
while a function object implementation is
1  class AddOne { 
and for the latter we can use its instances as objects, which still work as functions.
1  vector<int> values{1, 2, 4, 5}; 
where AddOne()
is an unnamed instance of the class AddOne
.
In C++ we have inline function definition as
1  int f = [](int x, int y) { return x + y; }; 
The []
is called the capture operator and it has rules as follows.
[=]
captures everything by value (read but no write access)[&]
captures everything by reference (read and write access)[=,&x]
captures everything but x by value, and for x by reference[&,x]
captures everything but x by reference, and for x by valueBelow we introduce some features in STL.
Two of the most commenly seen methods are begin()
and end()
We have binary_search
, for_each
, find_if
and sort
.
In the STL, algorithms are implemented as functions, and data types in containers.
1  int main() { 
1  int main() { 
1  bool PersonSortCriterion(const Person& p1, const Person& p2) { 
By combining STL algorithms with lambdas in C++ can be very efficient. We can use lambdas in a loop without defining a function beforehand.
1  vector<int> v{1, 3, 2, 4, 6}; 
We can also use it as a sorting criterion
1  std::vector<Person> ppl; 
We can have templates of a function:
1  template <class T> T sum(T a, T b) {return a + b; } 
We can also have templates of a class:
1  template <class T> 
1  int x, y; 
1 

Specifically, for the open modes we have
Mode  Description 

ios::app  Append to the end 
ios::ate  Go to the end of file on opening 
ios::binary  Open in binary mode 
ios::in  Open file for reading only 
ios::out  Open file for writing only 
ios::nocreate  Fail if you have to create it 
ios::noreplace  Fail if you have to replace 
ios::trunc  Remove all content if the file exists 
It's been months since my last update on cryptocurrency arbitrage strategies. The original version has been completely driven off the market and thus I decided to develop a new one. The market is primitive and savage in many senses, by which I mean there're supposed to be a bunch of inefficiency and corresponding arbitrage opportunities.
On the top of the page is the backtest PnL of the new strategy 4 from 01/01 up to yesterday, 07/26. I used 1 minute historical orderbook data, 5 spreads for slippage (not sure if it's still too conservative, need testing) and benchmarked the simplest buyandhold strategy. It's known that the whole crypto market has experienced a huge slump since late last year, so I guess my trick works quite well. The strategy is now running on my AWS in real money and I'll update this post whenever any interesting (or frustrating) issue happens.
Cheers.
Update Aug 3:
I changed the screening window length and the performance (of backtest) increased over tenfold. The image on the top has been updated with Sharpe ratios labelled.
]]>It's not hard to write a swap function. The most orthodox way that's being used in C++ or Java is by using a temporary variable. For example, say we have a = 0
and b = 1
, and we'd like to swap the values of these two variables. The pseudocode shall be something as below.
1  temp = a 
However, a more "Pythonic" way to do so is by literally "swapping" the values in place. Specifically, we don't even need to define a function for it, so the title picture is actually nonsense.
1  a, b = b, a 
How is that handled inside Python? Before answering that question, how is "Pythonic" defined? Well, Pythonic means code that doesn't just get the syntax right but that follows the conventions of the Python community and uses the language in the way it is intended to be used (Abien Fred Agarap^{[1]}). Talking about the conventions of the Python community, we won't be able to miss the famous Zen of Python:
1  import this 
Beautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.Special cases aren't special enough to break the rules.Although practicality beats purity.Errors should never pass silently.Unless explicitly silenced.In the face of ambiguity, refuse the temptation to guess.There should be one and preferably only one obvious way to do it.Although that way may not be obvious at first unless you're Dutch.Now is better than never.Although never is often better than *right* now.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.Namespaces are one honking great idea  let's do more of those!
So our oneline swapping exactly follows these supreme principles: it's beautiful, explicit, simple and perfectly readable. The only remaining question is, what happened when we called a, b = b, a
, and what are the technical differences between this lazy trick and the orthodox one?
Well, here is the thing. Just like most other programming languages, Python also handles assignment statements in a righttoleft manner. So before we actually assign the value of a
to b
and vice versa, Python pakcages the RHS as a tuple temporarily stored in memory. Then it assigns the values of this tuple to the LHS in order. That it. As a result, different from the orthodox swap function which creates a temporary variable temp
staying in our memory until being collected manually (if we're using it in the global environment) or after the function is destroyed, the Pythonic swap occupies doubled memory yet frees automatically thanks to Python's garbage collection. That's kind of a tradeoff and in some cases when absolute available memory is critically short, we might be suggested to use the more orthodox swap function.
Just as a supplement, there is in fact a way to swap in place while avoid using doubled memory. The trick is illustrated as follows.
1  a = a + b 
In case of large integers, we may also use XOR functions:
1  a = a ^ b 
My trading bot just ceased this morning from its loyal 24/7 service. It's running on an Amazon EC2 server with Ubuntu 16.04
and I'm sure this time I'm not having an unpaidbill issue any more. After some time digging I think I finally figure out the cause of this unexpected strike  asynchronism.
Asynchronism, or in simple terms, timing discrepancy, usually means a tiny bit of difference between local time on your computer/server and the global NTP time. It can be as undetectable as several milliseconds but in some applications like trading, such discrepancies are reckoned intolerable and any request sent from such computers/servers are ruthlessly rejected. Computers are just machines and they cannot be accurate in time forever. That's why we need (time) synchronization. In fact, EC2 does have such regular synchronization built in, but it seems it only happens once after a rather long period, like days. In order to adjust the synchronization period length to avoid similar issues in the future, I'll need the Amazon Time Sync Service.
First we install the chrony
package for synchronization, and open its configuration.
1  sudo apt install chrony 
Append in the opened chrony.conf
file a line as follows.
1  server 169.254.169.123 prefer iburst 
Restart chrony
service.
1  sudo /etc/init.d/chrony restart 
[ ok ] Restarting chrony (via systemctl): chrony.service.
Make sure that chrony
is successfully synchronizing time from 169.254.169.123
1  chronyc sources v 
210 Number of sources = 7 . Source mode '^' = server, '=' = peer, '#' = local clock. / . Source state '*' = current synced, '+' = combined , '' = not combined, / '?' = unreachable, 'x' = time may be in error, '~' = time too variable. . xxxx [ yyyy ] +/ zzzz Reachability register (octal) .  xxxx = adjusted offset, Log2(Polling interval) .   yyyy = measured offset, \   zzzz = estimated error.   \MS Name/IP address Stratum Poll Reach LastRx Last sample===============================================================================^* 169.254.169.123 3 6 17 12 +15us[ +57us] +/ 320us^ tbag.heanet.ie 1 6 17 13 3488us[3446us] +/ 1779us^ ec2123423112.euwest 2 6 17 13 +893us[ +935us] +/ 7710us^? 2a05:d018:c43:e312:ce77:6 0 6 0 10y +0ns[ +0ns] +/ 0ns^? 2a05:d018:d34:9000:d8c6:5 0 6 0 10y +0ns[ +0ns] +/ 0ns^? tshirt.heanet.ie 0 6 0 10y +0ns[ +0ns] +/ 0ns^? bray.walcz.net 0 6 0 10y +0ns[ +0ns] +/ 0ns
where ^*
denotes the preferred time source.
Finally, check the synchronization report.
1  chronyc tracking 
Reference ID : 169.254.169.123 (169.254.169.123)Stratum : 4Ref time (UTC) : Thu Jul 12 16:41:57 2018System time : 0.000000011 seconds slow of NTP timeLast offset : +0.000041659 secondsRMS offset : 0.000041659 secondsFrequency : 10.141 ppm slowResidual freq : +7.557 ppmSkew : 2.329 ppmRoot delay : 0.000544 secondsRoot dispersion : 0.000631 secondsUpdate interval : 2.0 secondsLeap status : Normal
As a conclusion, the server is now synchronizing time to the assigned source every 2 seconds and we shall never encounter similar issues.
]]>This was the last photo taken before we left Giethoorn, a small yet heavenly village. Hundreds of fragments are surrounded by tiny rivers and connected by wooden bridges only longer than a car. Talking about cars, the village was carfree and people commute by boats or bikes. We also love the thatchedroof houses which I suppose had been standing there for centuries, along with the wheat fields and the huge reed marshes.
The photo is probably my favorite shot throughout the past two years  if it is better than the foggymorning one taken in Hallstatt, Austria at the foot of the Alps.
]]>This is a note of Linear Discriminant Analysis (LDA) and an original Regularized Matrix Discriminant Analysis (RMDA) method proposed by Jie Su et al, 2018. Both methods are suitable for efficient multiclass classification, while the latter is a stateoftheart version of the classical LDA method s.t. data in matrix forms can be classified without destroying the original structure.
The plain idea behind Discriminant Analysis is to find the optimal partition (or projection, for higherdimensional problems) s.t. entities within the same class are distributed as compactly as possible and entities between classes are distributed as sparsely as possible. To derive closedform solutions we have various conditions on the covariance matrices of the input data. When we assume covariances \(\boldsymbol{\Sigma}\_k\) are equal for all classes \(k\in\{1,2,\ldots,K\}\), we're following the framework of Linear Discriminant Analysis (LDA).
As shown above, when we consider a 2dimensional binary classification problem, the LDA is equivalently finding the optimal direction vector \(\boldsymbol{w}\) s.t. the ratio of \(\boldsymbol{w}^T\boldsymbol{S}\_b\boldsymbol{w}\) (sum of betweenclass covariances of the projections) and \(\boldsymbol{w}^T\boldsymbol{S}\_w\boldsymbol{w}\) (sum of withinclass covariances of the projections) is maximized. Specifically, we define
\[\boldsymbol{S}_b = (\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)^T(\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)\]
and
\[\boldsymbol{S}_w = \sum_{\boldsymbol{x}\in X_0}(\boldsymbol{x}  \boldsymbol{\mu}_0)^T(\boldsymbol{x}  \boldsymbol{\mu}_0) + \sum_{\boldsymbol{x}\in X_1}(\boldsymbol{x}  \boldsymbol{\mu}_1)^T(\boldsymbol{x}  \boldsymbol{\mu}_1).\]
Therefore, the objective of this maximization problem is
\[J = \frac{\boldsymbol{w}^T\boldsymbol{S}_b\boldsymbol{w}}{\boldsymbol{w}^T\boldsymbol{S}_w\boldsymbol{w}}\]
which is also called the generalized Rayleigh quotiet.
The homogenous objective can be equivalently written into
\[\begin{align}\min_{\boldsymbol{w}}\quad &\boldsymbol{w}^T\boldsymbol{S}_b\boldsymbol{w}\\\\\text{s.t.}\quad &\boldsymbol{w}^T\boldsymbol{S}_w\boldsymbol{w} = 1\end{align}\]
which, by using the method of Langrange multipliers, gives solution
\[\boldsymbol{w} = \boldsymbol{S}_w^{1}(\boldsymbol{\mu}_0  \boldsymbol{\mu}_1)\]
and the final prediction for new data \(\boldsymbol{x}\) is based on the scale of \(\boldsymbol{w}^T\boldsymbol{x}\).
For multiclass classification, the solution is similar. Here we propose the score function below without derivation:
\[\delta_k = \boldsymbol{x}^T\boldsymbol{\Sigma}^{1}\boldsymbol{\mu}_k  \frac{1}{2}\boldsymbol{\mu}_k^T\boldsymbol{\Sigma}^{1}\boldsymbol{\mu}_k + \log\pi_k\]
where \(\boldsymbol{\mu}\_k\) is the sample mean of all data within class \(k\), and \(\pi_k\) is the percentage of all data that is of this class. By comparing these \(k\) scores we determine the best prediction with the highest value.
We first load necessary packages.
1  %config InlineBackend.figure_format = 'retina' 
Now we define a new class called LDA
with a predict
(in fact also predict_prob
) method.
1  class LDA: 
Then we define three classes of 2D input \(\boldsymbol{X}\) and pass it to the classifier. Original as well as the predicted distributions are plotted with accuracy printed below.
1  np.random.seed(2) 
Training accuracy: 95.67%
For data with inherent matrix forms like electroencephalogram (EEG) data introduced in Jie Su (2018), the classical LDA is not the most appropriate solution since it forcibly requires vector input. To use LDA for classification on such datasets we have to vectorize the matrices and potentially losing some critical structural information. Authors of this paper invented this new method called Regularized Matrix Discriminant Analysis (RMDA) that naturally takes matric input in analysis. Furthermore, noticing that inversing large matrix \(\boldsymbol{S}_w\) in high dimensions can be computationally burdonsome, they adopted the Alternating Direction Method of Multipliers (ADMM) to iteratively optimize the objective instead of the widelyused Singular Valur Decomposition (SVD). A graphical representation of the RMDA compared with LDA is as follows.
The algorithm is implemented below. Notice here I skipped the Gradient Descent (GD) approach in the minimization during iterations and opt for the minimize
function in scipy.optimize
. I did so to make the structure simpler without hurting the understanding of the whole algorithm. For more detailed illustration please resort to the original paper.
Again we first define the class RMDA
. The predict
method now takes a matrix.
1  class RMDA: 
Then we train the model and print the final accuracy.
1  np.random.seed(2) 
Optimization converged successfully.Training accuracy: 87.00%
Further analysis and debugging should be expected. Any correction in comments is also welcomed. 😇
This is the fifth post on optimal order execution. Based on Almgren and Chriss (2000), today we attempt to estimate the market impact coefficient \(\eta\). Specifically, for highfrequency transaction data, we have the approximation \(dS = \eta\cdot dQ\) and thus can easily estimate it by the method of Ordinary Least Squares (OLS), using the message book data provided by LOBSTER.
We first explore the message book of Apple Inc. (symbol: AAPL
) from 09:30 to 16:00 on June 21, 2012.
1  import pandas as pd 
According to the instructions by LOBSTER, the columns of the message book are defined as follows:
1
means submission of a new limit order; 2
means Cancellation (partial deletion of a limit order); 3
means deletion (total deletion of a limit order); 4
means execution a visible limit order; 5
means Execution of a hidden limit order; 7
means Trading halt indicator (detailed information below)1
means means Sell limit order; 1
means Buy limit order1  message = pd.read_csv('data/AAPL_20120621_34200000_57600000_message_1.csv', header=None) 
time  type  id  size  price  direction  

0  34200.004241  1  16113575  18  585.33  1 
1  34200.025552  1  16120456  18  585.91  1 
2  34200.201743  3  16120456  18  585.91  1 
3  34200.201781  3  16120480  18  585.92  1 
4  34200.205573  1  16167159  18  585.36  1 
5  34200.201781  3  16120480  18  585.92  1 
6  34200.205573  1  16167159  18  585.36  1 
1  message_plce = message[message.type==1] 
Index(['time_x', 'type_x', 'id', 'size_x', 'price_x', 'direction_x', 'time_y', 'type_y', 'size_y', 'price_y', 'direction_y'], dtype='object')
1  df = message_temp[['id', 'time_x', 'time_y', 'size_y', 'price_x', 'direction_x']] 
(15099, 7)
Here I defined a function impact
to calculate the market impact (reflected on price deviation), such that for each successful execution, we calculate the price change after the same duration of the order.
1  def impact(idx): 
1  df['impact'] = [impact(i) for i in df.index] 
0  1  2  3  4  5  6  7  8  9  ...  2452  2453  2454  2455  2456  2457  2458  2459  2460  2461  

dQ  1.0  10.0  9.00  40.00  18.00  100.00  18.00  18.00  66.00  18.0  ...  100.00  19.0  10.0  90.00  10.00  40.00  50.00  1.00  100.00  100.00 
dS  0.2  0.2  0.03  0.19  0.07  0.09  0.21  0.03  0.05  0.0  ...  0.01  0.0  0.0  0.05  0.05  0.05  0.05  0.08  0.08  0.03 
1  fig = plt.figure(figsize=(14, 6)) 
1  res = sm.ols(formula='dS ~ dQ + 0', data=df_reg).fit() 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.148Model: OLS Adj. Rsquared: 0.148Method: Least Squares Fstatistic: 427.7Date: Sat, 12 May 2018 Prob (Fstatistic): 1.01e87Time: 14:02:16 LogLikelihood: 1535.7No. Observations: 2459 AIC: 3069.Df Residuals: 2458 BIC: 3064.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0005 2.37e05 20.680 0.000 0.000 0.001==============================================================================Omnibus: 2646.045 DurbinWatson: 1.287Prob(Omnibus): 0.000 JarqueBera (JB): 323199.873Skew: 5.154 Prob(JB): 0.00Kurtosis: 58.210 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Apparently there're several outliers that result in a low \(R^2\). Here we remove outliers that are lying outside three standard deviations.
1  df_reg_no = df_reg[((df_reg.dQ  df_reg.dQ.mean()).abs() < df_reg.dQ.std() * 3) & 
1  fig = plt.figure(figsize=(14, 6)) 
1  res = sm.ols(formula='dS ~ dQ + 0', data=df_reg_no).fit() 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.296Model: OLS Adj. Rsquared: 0.295Method: Least Squares Fstatistic: 1005.Date: Sat, 12 May 2018 Prob (Fstatistic): 1.45e184Time: 14:02:20 LogLikelihood: 2470.2No. Observations: 2397 AIC: 4938.Df Residuals: 2396 BIC: 4933.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0006 1.97e05 31.710 0.000 0.001 0.001==============================================================================Omnibus: 356.596 DurbinWatson: 1.108Prob(Omnibus): 0.000 JarqueBera (JB): 567.767Skew: 1.012 Prob(JB): 5.14e124Kurtosis: 4.259 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
So we conclude \(\hat{\eta}_{\text{AAPL}}=0.0006\) for the underlying timespan. However, what about other companies? The coefficients are expected to vary largely, which is though the very worst case we'd like to see.
We first define a function estimate
to automate what we've done above.
1  def estimate(symbol): 
The estimation for Microsoft Corp. (symbol: MSFT
) is as follows.
1  estimate('MSFT') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.229Model: OLS Adj. Rsquared: 0.228Method: Least Squares Fstatistic: 550.7Date: Sat, 12 May 2018 Prob (Fstatistic): 7.20e107Time: 14:04:51 LogLikelihood: 5732.8No. Observations: 1859 AIC: 1.146e+04Df Residuals: 1858 BIC: 1.146e+04Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 1.859e05 7.92e07 23.467 0.000 1.7e05 2.01e05==============================================================================Omnibus: 201.842 DurbinWatson: 0.778Prob(Omnibus): 0.000 JarqueBera (JB): 381.770Skew: 0.703 Prob(JB): 1.26e83Kurtosis: 4.719 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Amazon.com, Inc. (symbol: AMZN
) is as follows.
1  estimate('AMZN') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.294Model: OLS Adj. Rsquared: 0.293Method: Least Squares Fstatistic: 328.9Date: Sat, 12 May 2018 Prob (Fstatistic): 1.02e61Time: 14:06:56 LogLikelihood: 809.19No. Observations: 791 AIC: 1616.Df Residuals: 790 BIC: 1612.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0007 3.74e05 18.136 0.000 0.001 0.001==============================================================================Omnibus: 141.501 DurbinWatson: 1.022Prob(Omnibus): 0.000 JarqueBera (JB): 250.801Skew: 1.083 Prob(JB): 3.46e55Kurtosis: 4.709 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Alphabet Inc. (symbol: GOOG
) is as follows.
1  estimate('GOOG') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.419Model: OLS Adj. Rsquared: 0.418Method: Least Squares Fstatistic: 324.2Date: Sat, 12 May 2018 Prob (Fstatistic): 5.96e55Time: 14:07:20 LogLikelihood: 169.55No. Observations: 450 AIC: 337.1Df Residuals: 449 BIC: 333.0Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 0.0017 9.57e05 18.005 0.000 0.002 0.002==============================================================================Omnibus: 48.913 DurbinWatson: 1.331Prob(Omnibus): 0.000 JarqueBera (JB): 61.896Skew: 0.864 Prob(JB): 3.63e14Kurtosis: 3.563 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimation for Intel Corp. (symbol: INTC
) is as follows.
1  estimate('INTC') 
OLS Regression Results ==============================================================================Dep. Variable: dS Rsquared: 0.237Model: OLS Adj. Rsquared: 0.237Method: Least Squares Fstatistic: 444.2Date: Sat, 12 May 2018 Prob (Fstatistic): 4.52e86Time: 14:08:47 LogLikelihood: 4480.8No. Observations: 1429 AIC: 8960.Df Residuals: 1428 BIC: 8954.Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>t [0.025 0.975]dQ 2.275e05 1.08e06 21.076 0.000 2.06e05 2.49e05==============================================================================Omnibus: 164.136 DurbinWatson: 0.716Prob(Omnibus): 0.000 JarqueBera (JB): 284.351Skew: 0.762 Prob(JB): 1.79e62Kurtosis: 4.566 Cond. No. 1.00==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In sum, the market impact are generally significant but not leading to high \(R^2\) values, which suggests the linear assumption might be too strong. Also, it is noteworthy that \(\hat{\eta}\) does vary largely between companies (let alone industries or equity types), which means we cannot use one estimation as a benchmark for general production usage.
]]>Today we implement the order placement strategy in Almgren and Chriss (2000) s.t. for a certain order size \(Q\), we can estimate the probability to perform the optimal strategy in the paper within time horizon of \(T\).
It is tolerable^{[1]} in HFT that we assume stock price evolves according to the discrete time arithmetic Brownian motion:
\[\begin{cases}dS(t) = \mu dt + \sigma dW(t),\\\\dQ(t) = \dot{Q}(t)dt\end{cases}\]where \(Q(t)\) is the quantity of stock we still need to order at time \(t\). Now let \(\eta\) denote the linear coefficient for temporary market impact, and let \(\lambda\) denote the penalty coefficient for risks. To minimize the cost function
\[C = \eta \int_0^T \dot{Q}^2(t) dt + \lambda\sigma\int_0^T Q(t) dt\]
we have the unique solution given by
\[Q^*(t) = Q\cdot \left(1  \frac{t}{T^*}\right)^2\]
where \(Q\equiv Q(0)\) is the total and initial quantity to execute, and the optimal liquidation horizon \(T^*\) is given by
\[T^* = \sqrt{\frac{4Q\eta}{\lambda\sigma}}.\]
Here, \(\eta\) and \(\lambda\) are exogenous parameters and \(\sigma\) is estimated from the price time series (see the previous post) within \(K\) time units, given by
\[\hat{\sigma}^2 = \frac{\sum_{i=1}^n (\Delta_i  \hat{\mu}_{\Delta})^2}{(n1)\tau}\]
where \(\\{\Delta_i\\}\) are the first order differences of the stock price using \(\tau\) as sample period, \(n\equiv\lfloor K / \tau\rfloor\) is the length of the array, and
\[\hat{\mu}_{\Delta} = \frac{\sum_{i=1}^n \Delta_i}{n}.\]
Notice that \(\hat{\sigma}^2\) is proved asymptotically normal with variance
\[Var(\hat{\sigma}^2) = \frac{2\sigma^4}{n}.\]
Now that we know
\[\hat{\sigma}^2 \equiv \frac{16Q^2\eta^2}{\lambda^2 \hat{T}^4} \overset{d}{\to}\mathcal{N}\left(\sigma^2, \frac{2\sigma^4}{n}\right)\]
which yields
\[\frac{16Q^2\eta^2}{\lambda^2\hat{\sigma}^2\hat{T}^4}\overset{d}{\to}\mathcal{N}\left(1, \frac{2}{n}\right),\]
to keep consistency of parameters, with \(n\equiv \lfloor K/\tau\rfloor \to\infty\) we can also write
\[\frac{16Q^2\eta^2}{\lambda^2\hat{\sigma}^2\hat{T}^4}\overset{d}{\to}\mathcal{N}\left(1, \frac{2\tau}{K}\right).\]
with which we can estimate the probability of successful strategy performance. Specifically, the execution strategy is given above, and the expected cost of trading is
\[C^* =\eta \int_0^{T^*} \left(\frac{2Q}{T}\left(1  \frac{t}{T^*}\right)\right)^2 dt + \lambda\sigma\int_0^{T^*} Q\cdot \left(1  \frac{t}{T^*}\right) dt =\frac{4\eta Q^2}{3T^*} + \frac{\lambda \sigma QT^*}{3} = \frac{4}{3}\sqrt{\eta\lambda\sigma Q^3}.\]
1  import numpy as np 
(1.465147881156472, 0.8431842483948604)
which means there's a probability of 84.3% that we can perform our order placement strategy of size 10 within 3.6405 time units and minimize the trading cost of 1.47 at optimum.
How to estimate the parameters of a geometric Brownian motion (GBM)? It seems rather simple but actually took me quite some time to solve it. The most intuitive way is by using the method of moments.
First let us consider a simpler case, an arithmetic Brownian motion (ABM). The evolution is given by
\[dS = \mu dt + \sigma dW.\]
By integrating both sides over \((t,t+T]\) we have
\[\Delta \equiv S(t+T)  S(t) = \left(\mu  \frac{\sigma^2}{2}\right) T + \sigma W(T)\]
which follows a normal distribution with mean \((\mu  \sigma^2/2)T\) and variance \(\sigma^2 T\). That is to say, given \(T\) and i.i.d. observations \(\\{\Delta_1,\Delta_2,\ldots,\Delta_n\\}\) for different \(t\) values^{[1]}, with sample mean
\[\hat{\mu}_{\Delta} = \frac{\sum_{i=1}^n\Delta_i}{n}\overset{p}{\to}\left(\mu  \frac{\sigma^2}{2}\right)T\]
and modified sample variance
\[\hat{\sigma}_{\Delta}^2 = \frac{\sum_{i=1}^n (\Delta_i  \hat{\mu}_{\Delta})^2}{n1} \overset{p}{\to} \sigma^2 T,\]
we have unbiased estimator for \(\mu\)
\[\hat{\mu} = \frac{2\hat{\mu}_{\Delta} + \hat{\sigma}_{\Delta}^2}{2T}\]
and for \(\sigma^2\) we have
\[\hat{\sigma}^2 = \frac{\hat{\sigma}_{\Delta}^2}{T}.\]
Now we prove the consistency. First we consider the variance of \(\hat{\mu}_{\Delta}\)
\[Var(\hat{\mu}_{\Delta}) = \frac{Var(\Delta_1)}{n} = \frac{\sigma^2 T}{n}\]
and the variance of \(\hat{\sigma}_{\Delta}^2\)
\[Var(\hat{\sigma}_{\Delta}^2) =E(\hat{\sigma}_{\Delta}^4)  E(\hat{\sigma}_{\Delta}^2)^2 =\frac{n E[(\Delta_1\hat{\mu}_{\Delta})^4] + n(n1) E[(\Delta_1\hat{\mu}_{\Delta})^2]^2}{(n1)^2}  \sigma^4T^2 =\frac{2\sigma^4T^2}{n}.\]
The variance of \(\hat{\mu}\) is therefore given by
\[Var(\hat{\mu}) =\frac{4Var(\hat{\mu}_{\Delta}) + Var(\hat{\sigma}_{\Delta}^2)}{4T^2} =\frac{\sigma^2 (2 + \sigma^2T)}{2nT}\]
and the variance of \(\hat{\sigma}^2\) is given by
\[Var(\hat{\sigma}^2) =\frac{Var(\hat{\sigma}_{\Delta}^2)}{T^2} =\frac{2\sigma^4}{n}.\]
So the two estimators are also both consistent. It should be noticed that there exists certain "tradeoff" between the efficiency of \(\hat{\mu}_{\Delta}\) and \(\hat{\sigma}_{\Delta}^2\) by varying the value of \(T\).
For a general GBM with drift \(\mu\) and diffusion \(\sigma\), we have PDE
\[\frac{dS}{S} = \mu dt + \sigma dW,\]
so we can integrate^{[2]} the both sides within \((t,t+T]\) for any \(t\) and get
\[\Delta \equiv \ln S(t+T)  \ln S(t) = \left(\mu  \frac{\sigma^2}{2}\right) T + \sigma W(T).\]
The rest derivation is exactly the same.
Now we numerically validate this against monte Carlo simulation.
1  import numpy as np 
Statistics  monte Carlo  Method of moment  P Value 

E(mu_hat)  1.994533e03  2.000000e03  0.222191 
Var(mu_hat)  4.010866e07  3.924000e07   
E(sigma2_hat)  3.596733e03  3.600000e03  0.201573 
Var(sigma2_hat)  1.308537e07  1.296000e07   
Now we may safely apply this estimation in application.
Here I'm trying to write something partly based on Cont's first model in the previous post. I plan to skip the Laplace transform and go for Monte Carlo simulation. Also, I'm trying to abandon the assumption of unified order sizes. To implement that, I need to shift from a Markov chain which is supported by discrete spaces, onto some other stochastic process that is estimatable. Moreover, although I actually considered supervised learning for this problem, I gave it up at last. This is because my model is inherently designed for high frequency trading and thus training for several minutes each time would be intolerable.
1  import smm 
I need smm
for multivariate stochastic processes, and scipy.optimize
for maximum likelihood estimation.
1  def retrieve_data(date): 
time  ask_price_1  ask_price_10  ask_price_100  ask_price_101  ask_price_102  ask_price_103  ask_price_104  ask_price_105  ask_price_106  ...  bid_vol_90  bid_vol_91  bid_vol_92  bid_vol_93  bid_vol_94  bid_vol_95  bid_vol_96  bid_vol_97  bid_vol_98  bid_vol_99  

1  20180129 00:00:06.951631+08:00  12688.00  12663.58  12391.48  12390.00  12389.96  12388.00  12384.22  12381.39  12380.00  ...  6.0  15.0  1.0  460.0  4.0  121.0  5.0  1.0  5.0  120.0 
2  20180129 00:00:07.792882+08:00  12676.93  12657.04  12391.48  12390.00  12389.96  12388.00  12384.22  12381.39  12380.00  ...  1.0  400.0  363.0  5.0  6.0  15.0  1.0  460.0  4.0  121.0 
3  20180129 00:00:08.702945+08:00  12643.27  12617.26  12361.27  12360.00  12359.38  12358.06  12356.22  12355.44  12354.17  ...  6.0  15.0  1.0  460.0  4.0  121.0  5.0  1.0  5.0  120.0 
4  20180129 00:00:10.998615+08:00  12666.00  12642.73  12380.00  12377.00  12374.99  12369.73  12366.43  12365.84  12361.45  ...  460.0  4.0  121.0  5.0  1.0  5.0  120.0  150.0  12.0  97.0 
5  20180129 00:00:11.742304+08:00  12674.00  12643.27  12384.22  12381.39  12380.00  12377.00  12374.99  12369.73  12366.43  ...  4.0  121.0  5.0  60.0  1.0  5.0  120.0  150.0  12.0  97.0 
Larger index means smaller values for both bid and ask prices. It's uncommon and here I reindexed the variables s.t. bid_1
and ask_1
corresponds with the best opponent prices.
1  def rename_index(s): 
1  variables = list(data.columns[1:]) 
I dropped the time
variable simply because I don't know how to use it. Normally there're two ways to handle uneven timegrids: resampling and ignoring, and I chose the latter.
1  def plot_lob(n, t, theme='w'): 
Now we make a plot of the order book within the past 10 steps, including 20 bid levels and 20 ask levels.
1  n, t = 20, 10 
Not sure if it tells any critical information. Let's make another plot. This time \(t=500\) and we only consider the best bid and ask orders.
1  fig = plt.figure(figsize=(12, 6)) 
1  price = data[[f'bid_price_{i}' for i in range(n,0,1)] + [f'ask_price_{i}' for i in range(1,n+1)]] 
A simple idea would be inputting the prices and volumes in the current orderbook, and predict the future mid prices. Furthermore, it's ideal to have a rough expectation of the minimum time that the mid price crosses a certain price, or the time needed in expectation before my order got executed successfully.
1  change = [] 
The calculation of change
took over 10 minutes. I don't think it's gonna be useful in real work. However, it's not so bad an idea to save it somewhere locally in case I need it later.
1  change = pd.DataFrame(np.array(change).astype(int), columns=vol.columns) 
1  change = pd.read_csv(f'data/change_{date}.csv', index_col=0) 
After some research, I decided to fit the data in change
to student's tdistribution, Skellam distribution, and twoside Weibull distribution. I'll now elaborate reason why I chose, and how to estimate each distribution below.
First is the tdistribution. It is wellknown for its leptokurtosis which suits well in many financial time series as a better alternative to Normal distribution. The PDF and CDF of the tdistribution involves the Gamma function and thus would be computationally troublesome when we want to calculate the MSE of the parameters. However, notice for any r.v. \(X\sim t(\nu,\mu,\sigma)\), we have relationship
\[\text{Var}(X) = \begin{cases}\frac{\nu}{\nu  2} & \text{for }\nu > 2,\\\infty & \text{for }1 < \nu \le 2,\\\text{undefined} & \text{otherwise}\end{cases}\]
and
\[\text{Kur}_+(X) = \begin{cases}\frac{6}{\nu  4} & \text{for }\nu > 4,\\\infty & \text{for }2 < \nu \le 4,\\\text{undefined} & \text{otherwise}\end{cases}\]
where \(\text{Kur}_+\equiv \text{Kur}  3\) is the excess kurtosis, we can simply go for moment estimation for tdistribution using empirical variance or kurtosis.
Second, the Skellam distribution. This is mainly due to the original model used in Cont's paper, where he assumes Poisson order arrivals uniformly over the time. Here I slightly improve the model s.t. bid and ask orders are modelled in the same time and represented by r.v. \(S\equiv P_a  P_b\) where \(P_a\sim Pois(\lambda_a)\) and \(P_b\sim Pois(\lambda_b)\). This is therefore a discrete distribution with two parameters. scipy.stats
has its PMF implemented and all I need to do is numerically maximize the likelihood.
For the twosided Weibull distribution, it is given by
\[Y = \begin{cases}\text{Weibull}(\lambda_1, k_1) & \text{if } Y < 0,\\\text{Weibull}(\lambda_2, k_2) & \text{otherwise}\end{cases}\]
where shape parameters $k_{1,2}0 $ and scale parameters \(\lambda_{1,2} > 0\).
Therefore, the pdf is
\[f(y \mid \lambda_1, k_1, \lambda_2, k_2) = \begin{cases}\left(\frac{y}{(\lambda_1)}\right)^{k_1 1}\exp\left(\left(\frac{y}{(\lambda_1)}\right)^{k_1}\right) & \text{if y < 0},\\\left(\frac{y}{(\lambda_2}\right)^{k_2 1}\exp\left(\left(\frac{y}{(\lambda_2)}\right)^{k_2}\right)& \text{otherwise}\end{cases}\]
and to normalize the integration to \(1\), we also have
\[\frac{\lambda_1}{k_1} + \frac{\lambda_2}{k_2} = 1 \Rightarrow \lambda_2 = k_2 (1  \lambda_1 / k_1)\]
which means there're in fact only three parameters to estimate.
Now we rewrite the loglikelihood as
\[\begin{align\*}LL = \sum_{i=1}^n \log(f(y_i))= \sum_{i=1}^n &\left((k_11)(\log^\*(y_i)  \log^\*(\lambda_1))  (y_i / \lambda_1)^{k_1}\right)\mathbb{I}_{y_i < 0} + \\ &\left((k_21)(\log^\*(y_i)  \log^\*(\lambda_2))  (y_i / \lambda_2)^{k_2}\right)\mathbb{I}_{y_i \ge 0}.\end{align\*}\]
where we have the special \(\log^*(y)\equiv 0\) if \(y\le0\).
1  i = 15 # take ask_15 for example 
As coded above, at last I didn't include twosided Weibull distribution because the optimization did not converge. In conclusion, for changes of order sizes (denoted by \(x\)), we use modified tdistribution with
\[\hat{\mu} = \bar{x},\quad \hat{\sigma} = 0.3 \cdot \sqrt{\widehat{\text{Var}}(x)} + 0.7 \cdot \sqrt{\left(2  \frac{6}{6 + 2\ \widehat{\text{Kur}}_+(x)}\right)}\]
and
\[\hat{\nu} = \frac{6}{\widehat{\text{Kur}}_+(x)} + 4\]
where
\[\widehat{\text{Kur}}_+(x) = \widehat{\text{Kur}}(x)  3\]
while
\[\widehat{\text{Kur}}(x) = \hat{m}_4(x) / \hat{m}_2^2(x)\]
and
\[\hat{m}_4 = \sum_{i=1}^n (x_i  \bar{x})^4 / n,\quad \hat{m}_2 = \sum_{i=1}^n (x_i  \bar{x})^2 / n.\]
Now, when we assume independence across different buckets of order book, we can estimate the parameters of tdistributions as below.
1  params = np.zeros([2 * n, 3]) 
array([[ 5.1589201 , 0.52536232, 11.05729 ], [ 5.86412495, 0.61454545, 12.08484143], [ 5.82376701, 4.61231884, 11.67543236], [ 6.28819815, 0.7173913 , 10.85941723], [ 6.89178374, 1.59927798, 11.25140225], [ 6.14231284, 2.29856115, 12.46686452], [ 6.4347771 , 2.22302158, 13.73785226], [ 6.17737187, 0.67753623, 12.19098061], [ 5.9250571 , 1.68231047, 12.54472066], [ 5.16886809, 0.69090909, 11.94199489], ... [ 5.94772822, 3.18181818, 12.4415555 ], [ 6.5157695 , 4.62181818, 13.67098387], [ 6.69385395, 0.66304348, 13.63770319], [ 4.99329442, 1.11510791, 11.63780506], [ 5.04144977, 1.91756272, 11.20026029], [ 5.47054269, 4.34163701, 10.66971035], [ 5.11684414, 2.35460993, 9.98656422], [ 4.89130697, 1.07092199, 11.5511127 ], [ 5.31202782, 0.58865248, 11.01769165], [ 5.17908162, 2.16961131, 10.81368767]])
When we do not ignore the correlation across all buckets, a multivariate tdistribution must be considered. Similar to multivariate Normal distributions, here we need to estimate a covariance matrix, a vector of expectations and a vector of degrees of freedom. Notice the degrees of freedom do not vary significantly across the rows in params
, to accelerate computation I set a unified degree of freedom for all buckets, namely \(df = 7\). Using Expectation Maximization (EM) algorithm introduced by D. Peel and G. J. McLachlan (2000), I wrote the model below to estimate this distribution.
1  class MVT: 
Now the distribution for order size movement is estimated. We can simulate the trajectory and rebuild the order book in future several steps. Specifically, notice the predicted movement may well change the shape of the order book while, according to practical observation, the order book retains its "V"shape in most of the time. Therefore, I sort up separately both halves of the order book every time they're updated by a predicted order size movement (or "comovement", since it should be a vector).
1  n_steps = 20 
Below is a simple sketch of this order book trajectory where I assign stronger color to the traces that are closer to the best (bid/ask) prices.
1  fig = plt.figure(figsize=(12, 6)) 
It can be seen from the figure that stronger traces are located more to the bottom, which validates our intuition since trades around the current price are more active than those to the left or the right of the order book.
With this prediction procedure implemented, we can estimate the probability of our order (placed at the price bucket order_idx
with size order_size
) being executed within n_steps
.
1  n_steps = 10 
0.861
So a limit buy order at bid_8
(\(20  12 = 8\)) with size 100 can be executed within 10 steps, at a probability of 86.1%. Moreover, we can even make a 3D surface plot to get a comprehensive idea of the whole distribution.
1  def evolve(order_idx, order_size, n_steps=10, n_sim=1000): 