Notes on Data-Driven Science and Engineering

2024-07-14

Dimensionality Reduction and Transforms

In this part the authors mainly talk about how to reduce the dimensionality of a dataset.

Singular Value Decomposition (SVD)

Matrix approximation and image compression
Condition number: describes the sensitivity of matrix multiplication and inversion to the errors in the input $$ \kappa(\boldsymbol{A}) = \frac{\sigma_{\max}(\boldsymbol{A})}{\sigma_{\min}(\boldsymbol{A})} $$
Principal component analysis (PCA)
Optimal hard threshold is theoretically solvable if we can assume $\gamma$ the ratio of white noise
Randomized SVD (involving random projections e.g. Gaussian random projection matrix with all its elements i.i.d. Gaussian) which is especially efficient for low-rank matrices
Tensor decompositions for generalized matrices e.g. N-way arrays or tensors

Fourier and Wavelet Transforms

If function $f(x)$ is periodic and piecewise smooth, then it can be written as a Fourier series (infite sum of cosines and sines of increasing frequency)
Multiplying functions in the frequency domain is equivalent to convolving functions in the spatial domain
Discrete Fourier Transform (DFT) is important for real data, and Fast Fourier Transform (FFT) is revolutionary for realistic application (by approximating $n$ to the nearest power of 2, the time complexity is improved from $\mathcal{O}(n^2)$ to $\mathcal{O}(n\log(n))$ , with $n$ being the number of data points)
We can use Fourier transform (and Gabor transform a.k.a. short-time Fourier transform or sliding window Fourier transform) to analyze the behavior of dynamic systems which involve partial differential equations (PDEs) e.g. transform them into ordinary differential equations (ODEs) that are easier to solve
Uncertainty principles: if $f(x)$ is absolutely continuous and both $xf(x)$ and $f'(x)$ are square integrable, we have $$ \left(\int_{-\infty}^{\infty}x^2|f(x)|^2\mathrm{d}x\right) \left(\int_{-\infty}^{\infty}\omega^2|\hat{f}(\omega)|^2\mathrm{d}w\right)\ge \frac{1}{16\pi^2} $$ which basically says the product of the variances in spacial and frequency domains has a lower bound
Laplace transform extends Fourier transform into functions that are not Lebesgue integrable (not converging to zero on infinity)
Wavelets further extend Fourier analysis to more general orthogonal bases and partially overcome the uncertainty principle by exploiting a multi-resolution decomposition (more efficient for image compression even with very aggressive truncation)

Sparcity and Compressed Sensing

The vector $s\in\mathbb{R}^n$ is called K-sparse in $\Psi$ if there are exactly K non-zero elements in $x:=\Psi s\in\mathbb{R}^n$ . $\Psi$ can be generic like Fourier or wavelet
Compressed sensing exploits the sparsity of a signal in a generic basis to achieve full signal reconstructin from surprisingly few measurements (used to be NP-hard until recently)
Restricted isometry property (RIP) assures that the matrix $\boldsymbol{C\Psi}$ is asymptotically isometric with the number of measurements
Sparse regression, like least absolute deviations (LAD) and LASSO, can alleviate the sensitivity to outliers
Sparse representation for classification (SRC) can be used for image processing and classifying dynamic regimes in nonlinear differential equations
Robust principal component analysis (RPCA) seeks to decompose a data matrix $\boldsymbol{X}$ into a structured low-rank matrix $\boldsymbol{L}$ and a sparse matrix $\boldsymbol{S}$ containing outliers and corrupt data: $$ \min_{\boldsymbol{L},\boldsymbol{S}} \text{rank}(\boldsymbol{L}) + \lVert\boldsymbol{S}\rVert_0 \quad \text{s.t.}\quad \boldsymbol{L} + \boldsymbol{S} = \boldsymbol{X} $$ Considering neither $\text{rank}(\cdot)$ or $\lVert\cdot\rVert_0$ are convex and thus the original problem isn’t tractable, we can solve for the optimal $\boldsymbol{L}$ and $\boldsymbol{S}$ with high probability using a convex relaxation: $$ \min_{\boldsymbol{L},\boldsymbol{S}} \lVert\boldsymbol{L}\rVert_{\ast} + \lambda \lVert\boldsymbol{S}\rVert_1 \quad \text{s.t.}\quad \boldsymbol{L} + \boldsymbol{S} = \boldsymbol{X} $$ where $\lVert\cdot\rVert_{\ast}$ is the nuclear norm (sum of singular values, which can be seen as a proxy for rank) and $\lambda=1/\sqrt{\max(n,m)}$ , with $n$ and $m$ being the dimensions of $\boldsymbol{X}$ . We also need the following criteria to be satisfied: (1) $\boldsymbol{L}$ is not sparse, and (2) $\boldsymbol{S}$ is not low-rank (it’s assumed to be random, so naturally it shouldn’t be low-rank). This optimization problem is also known as principal component pursuit (PCP) and may be solved using the augmented Lagrange multiplier (ALM) algorithm or alternating directions method (ADM)
Sparse sensor placement: if we know that we’re reconstructing a human face, we can dramatically reduce the number of sensors required for reconstruction or classification by optimizing sensors for a particular feature library

Machine Learning and Data Analysis

In this part we cover a bunch of well-known machine learning algorithms in general.

Regression and Model Selection

Regression is a generalized viewpoint of curve fitting
Least squares is a special case of $l_p$ -norm minimization
Gradient descent (GD): solve nonlinear regressions which have no general closed-form solution $$ \boldsymbol{x}_{k+1}(\delta) = \boldsymbol{x}_k - \delta \nabla f(\boldsymbol{x}_k) $$ where $\delta$ is the step-size. In order to determine the optimal $\delta$ we can minimize $F(\delta)=f(\boldsymbol{x}_{k+1}(\delta))$ , namely finding $$ \frac{\partial F}{\partial \delta} = -\nabla f(\boldsymbol{x}_{k+1})\nabla f(\boldsymbol{x}_k) =0 $$ In other words, $\delta$ is chosen s.t. the two consecutive gradient directions are orthogonal
Alternating descent method (ADM): descent on one variable at a time (it’s potentially faster since it involves only a line search and is derivative-free)
For regression $\boldsymbol{Ax}=\boldsymbol{b}$ , an over-determined system has more constraints than unknown variables (thus potentially no solution) while an under-determined system has fewer constraints than unknown variables (thus infinite solutions)
A basic architecture for over-determined systems is to minimize the $l_2$ error while enforcing penalization on $l_1$ - and $l_2$ -norms: $$ \hat{\boldsymbol{x}} = \arg\min_{\boldsymbol{x}}\lVert\boldsymbol{Ax} - \boldsymbol{b}\rVert_2 + \lambda_1 \lVert\boldsymbol{x}\rVert_1 + \lambda_2 \lVert\boldsymbol{x}\rVert_2 $$
A basic architecture for under-determined systems is to minimize $l_p$ -norm within the solution space: $$ \min\lVert\boldsymbol{x}\rVert_p \quad \text{s.t.}\quad \boldsymbol{Ax} = \boldsymbol{b} $$
Some standard regression methods: pseudo-inverse (least-squares regression), LASSO, Ridge and robust regression
Without cross-validation, one will almost certainly produce a non-generalizable model that is severely overfitted
Some standard cross-validation processes include:
- K-fold: randomly partition the data into k subgroups for cross-validation
- Leave-p-out: repeatedly remove sample of size p from the training and kept as validation set
We can use divergence and information criteria (e.g. AIC and BIC) for model selection

Clustering and Classification

Supervised learning focuses on dataset that are labeled (including variants like semi-supervised learning, active learning and reinforcement learning), while unsupervised learning tackles problems without labels and tries to find patterns in the data in a principled way in order to determine how to cluster and classify new data
Some famous unsupervised learning algorithms:
- K-means
- Dendrogram: including agglomerative and divisive hierarchical clustering
- Mixture models (e.g. with Gaussian distribution we have Gaussian mixture models a.k.a. GMM) assuming the data observations are a mixture of a set of k processes that combine to form the measurement. To compute GMM we need expectation-maximization (EM) algorithm, which is designed to find maximum-likelihood parameters of statistical models
Some famous supervised learning algorithms:
- Linear discriminant analysis (LDA): maximizing tha ratio of between-class vs within-class variances $$ \boldsymbol{w} = \arg\max_{\boldsymbol{w}}\frac{\boldsymbol{w}^T\boldsymbol{S}_B\boldsymbol{w}}{\boldsymbol{w}^T\boldsymbol{S}_W\boldsymbol{w}} $$ where $\boldsymbol{S}_B$ and $\boldsymbol{S}_W$ are between- and within-class scatter matrices. This criterion is also kown as the generalized Rayleigh quotient corresponding to the generalized eigenvalue problem $$ \boldsymbol{S}_B\boldsymbol{w} = \lambda \boldsymbol{S}_W\boldsymbol{w} $$
- Support vector machine (SVM):
  - Linear SVM: optimizing hyperplanes to split the data into distinct clusters
  - Nonlinear SVM: optimizing nonlinear functions
  - Kernel SVM: solve the curse of dimensionality from nonlinear function optimization
- Decision trees (DT)
- Random forest (RF): ensembling multiple distinct DT with different subsamples of the data

Neural Networks and Deep Learnings

A generalized representation between input and output layers in a linear neural network (NN) would be $$ \boldsymbol{y} = \boldsymbol{A}_M\boldsymbol{A}_{M-1}\cdots\boldsymbol{A}_2\boldsymbol{A}_1\boldsymbol{x} $$ while a nonlinear NN would be $$ \boldsymbol{y} = f_M(\boldsymbol{A}_M f_{M-1}(\boldsymbol{A}_{M-1}\cdots f_2(\boldsymbol{A}_2 f_1(\boldsymbol{A}_1\boldsymbol{x}))\cdots)) $$ where $f(\cdot)$ can take different nonlinear forms and are called activation functions
Backpropagation seeks to use gradient of the output error to determine gradient descent updates, with the efficiency gained from calculating gradient backwards and reusing intermediate results all the time
Stochastic gradient descent (SGD) is GD but on ramdomly sampled subset data for faster computation
Feedforwward (FF) networks connect the input layer to the output layer by forming connections between the units so that they do not form a cycle
Deep convolutional neural network (DCNN) is a famous framework for deep learning. It includes
- Convolutional layers: similar to Gabor Fourier transform or wavelets, which slides across the entire layer to transform the input, useful for images
- Pooling layers: using e.g. max operation to reduce the spatial size of the representation in order to reduce the number of parameters and computation in the network, efficient to control overfitting and to fit the whole computation in memory
- Fully connected layers: in contrast to pooling and convolutional which are local, a fully connected layer restores global connectivity and provides a potentially important feature space for better performance
- Dropout layers: randomly dropping nodes in the network to prevent overfitting
Recurrent neural network (RNN) is an important class of neural network architectures that leverage sequential data streams, which is prevalent in speech recognition
Autoencoder (AE) neural networks are a flexible and advantageous structure for exploiting low-dimensional features in high-dimensional data. It generalizes the linear subspace embedding of SVD/PCA to a nonlinear manifold embedding, often of a lower dimension. More specifically, it maps the original high-dimensional input vectors to a low-dimensional latent variable and then back to the high-dimensional output
Variational autoencoder (VAE) is a probablistic version of AE and can generate new data instead of just extracting features as AE does
Denoising autoencoder (DAE) takes a partially corrupted input during training to recover the original undistorted input
Sparse autoencoder (SAE) imposes sparsity on the hidden units during training, while having a larger number of hidden units than inputs, so that an autoencoder can learn useful structures in the input data
Generative adversarial networks (GANs) learn how to produce synthetic data through an adversarial structure whereby two neural networks are trained simultaneously, namely a discriminator for classification (real vs fake) and a generator to produce synthetic data from a latent representation
Markov chain (MC), although not formally a NN, shares many common features with RNN
Hopfield network (HN) is a form of RNN for understanding human memory where each node can act as an input cell in the network, altogether making this an all-to-all connected network
Boltzmann machine (BM), sometimes called a stochastic Hopfield network with hidden units, is a stochastic generative counterpart of the Hopfield network. It’s named after the Boltzmann distribution in statistical mechanics which is used in their sampling function
Restricted Boltzmann machine (RBM) can be trained fro either supervised or unsupervised tasks. It’s a subset of BM with restriction on the nodes s.t. all nodes are bipartite between groups only, and there’re no connections between nodes within a group
Deep belief network (DBN) is a generative graphical model with connections between the layers but not between units within each layer
Deconvolutional network (DN) is essentially a reverse of DCNN, which can be used for e.g. denoising and object recognition
Deep convolutional inverse graphics network (DCIGN) is a form of VAE that uses DCNN for encoding and decoding
Liquid state machine (LSM) is a spiking neural network that consists of a large collection of nodes, each of which receives time-varying input from external sources (the inputs) as well as from other nodes, enabling it to turn the time-varying input into a spatio-temporal pattern of activations in the network nodes
Extreme learning machine (ELM) ahs the same underlying architecture of an LSM except the parameters of hidden nodes are seldom updated, namely only learning from as few as a single step
Echo state network (ESN) is a type of RNN with a sparsely connected hidden layer (with typically 1% connectivity)
Deep residual network (DRN) is a type of very deep FF networks where therre are extra connections that pass from one layer to a layer two to five layers downstream
Kohonen network (KN) is also known as self-organizing feature maps, which use competitive learning to classify data without supervision and is useful for low-dimensional visualization of high-dimensional data
Neural turing maching (NTM) implements a neural network controller coupled to an external memory resource which it interacts with through attentional mechanisms

Dynamics and Control

In this part we start using the tools mentioned above (and new tools that are based on them) to tackle actual dynamical systems.

Data-Driven Dynamical Systems

Continuous-time dynamical systems take the form of $$ \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{x}(t) = \boldsymbol{f}(\boldsymbol{x}(t),t;\boldsymbol{\beta}) $$ where $\boldsymbol{x}$ is the state of the system and $\boldsymbol{f}$ is a vector field that possibly depends on the state $\boldsymbol{x}$ , time $t$ and a set of parameters $\boldsymbol{\beta}$ . For example, below are the famous Lorenz equations $$ \begin{align*} \dot{x} &= \sigma (y-x)\\ \dot{y} &= x(\rho - z) - y\\ \dot{z} &= xy - \beta z \end{align*} $$
Discrete-time systems take the following form $$ \boldsymbol{x}_{k+1} = \boldsymbol{F}(\boldsymbol{x}_k) $$ where discrete-time propagator, also known as the flow map, is sometimes induced from a continuous-time system
For linear dynamics like $$ \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{x} = \boldsymbol{Ax} $$ we have solution $$ \boldsymbol{x}(t_0 + t) = e^{\boldsymbol{A}t}\boldsymbol{x}(t_0) = \boldsymbol{Q}e^{\boldsymbol{\Lambda}t}\boldsymbol{Q}^{-1}\boldsymbol{x}(t_0) $$ where $\boldsymbol{Q}$ and $\boldsymbol{\Lambda}$ come from the eigendecomposition of $\boldsymbol{A}=\boldsymbol{Q\Lambda Q}^{-1}$ . By defining $\boldsymbol{z}:=\boldsymbol{Q}^{-1}\boldsymbol{x}$ we have an even simpler form: $$ \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{z} = \boldsymbol{\Lambda}\boldsymbol{z} $$ with solution $$ \boldsymbol{z}(t_0 + t) = e^{\boldsymbol{\Lambda} t}\boldsymbol{z}(t_0) $$ which means the dynamics of each $z_i\in\boldsymbol{z}$ solely depends on itself, and that is very desirable
Goals with the analysis of dynamical systems:
- Future state prediction
- Design and optimization
- Estimation and control
- Interpretability and physical understanding
Challenges of modern dynamical systems:
- Nonlinearity
- Unknown dynamics
- High dimensionality
Approaches that are defining modern data-driven dynamical systems:
Operator-theoretic representations: to address nonlinearity in terms of infinite-dimensional but linear operators
- Data-driven regression and machine learning: discover dynamical systems from abundant data
- Reduced-order models (ROMs): to address the high dimensionality problem
Some open-source libraries:
Dynamic mode decomposition (DMD), particularly the version that’s presented here, works with irregularly sampled data and with concatenated data from several different experiments or numerical simulations. Given sampled snapshots $$ \begin{align*} \boldsymbol{X} = \begin{bmatrix} | & | & & |\\ \boldsymbol{x}(t_1) & \boldsymbol{x}(t_2) & \cdots & \boldsymbol{x}(t_m)\\ | & | & & | \end{bmatrix}\\ \boldsymbol{X}' = \begin{bmatrix} | & | & & |\\ \boldsymbol{x}(t_1') & \boldsymbol{x}(t_2') & \cdots & \boldsymbol{x}(t_m')\\ | & | & & | \end{bmatrix} \end{align*} $$ where $t_k':=t_k + \Delta t$ with sufficiently small time step $\Delta t$ . Notice here we don’t need to assume uniform sampling where $t_k:=k\Delta t$ and $t_k':=t_k + \Delta t=t_{k+1}$ . The DMD algorithm then seeks the leading eigendecomposition of the best-fit linear operator $\boldsymbol{A}$ that relates the two snapshot matrices in time: $$ \boldsymbol{X}’ \approx \boldsymbol{AX} $$ More mathematically we could describe the bets-fit operator $\boldsymbol{A}$ as $$ \boldsymbol{A} = \arg\min_{\boldsymbol{A}} \lVert\boldsymbol{X}’-\boldsymbol{AX}\rVert_F = \boldsymbol{X}’\boldsymbol{X}^{\dagger} $$ where $\boldsymbol{X}^{\dagger}$ is the pseudo-inverse of the matrix. This algorithm generalizes the optimization framework of the exact DMD (with uniform sampling) to perform regressions to exponential-time dynamics. We can efficiently compute DMD via SVD as $m\ll n$ in the data and we can project $\boldsymbol{A}$ onto the smaller subspace before decomposition.

To cope with noise in the data, we can include both forward and backward operators (if physically reasonable) in the optimization $$ \boldsymbol{X}' \approx \boldsymbol{A}_1\boldsymbol{X}\quad\text{and}\quad \boldsymbol{X}\approx\boldsymbol{A}_2\boldsymbol{X} $$ where $\boldsymbol{A}_2\approx \boldsymbol{A}_1$ for noise-free data. We then compute $\boldsymbol{A}$ as the average of $\boldsymbol{A}_1$ and $\boldsymbol{A}_2$ to remove the systematic bias from the noises in the data. In summary, we have this alternative optimization: $$ \boldsymbol{A} = \arg\min_{\boldsymbol{A}} \frac{1}{2}\left(\lVert\boldsymbol{X}’-\boldsymbol{AX}\rVert_F + \lVert\boldsymbol{X} - \boldsymbol{A}^{-1}\boldsymbol{X}’\rVert_F\right) $$ which unfortunately introduces nonlinearity and non-convexity to the objective function and thus make the problem high untractable. For that concern, we can alternatively formulate it as $$ \begin{align*} \boldsymbol{A} := \frac{1}{2}(\boldsymbol{A}_1 + \boldsymbol{A}_2) = \arg\min_{\boldsymbol{A}_1,\boldsymbol{A}_2} &\left(\lVert\boldsymbol{X}'-\boldsymbol{A}_1\boldsymbol{X}\rVert_F + \lVert\boldsymbol{X} - \boldsymbol{A}_2\boldsymbol{X}'\rVert_F\right)\\ \text{s.t.}& \quad \boldsymbol{A}_1\boldsymbol{A}_2 = \boldsymbol{I}\quad \text{and}\quad \boldsymbol{A}_2\boldsymbol{A}_1 = \boldsymbol{I} \end{align*} $$ which moves the nonlinearity to the constraints and thus we get to keep the linear, convex objective and tackle the nonlinear constraints using e.g. ADM algorithms
Sparse Identification of Nonlinear Dynamics (SINDy) algorithm bypasses the intractable combinatorial search through all possible model structures, leveraging the fact that many dynamical systems $$ \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{x} = \boldsymbol{f}(\boldsymbol{x}) $$ have dynamics $\boldsymbol{f}$ with only a few active terms in the space of possible right-hand side functions e.g. a few linear and quadratic terms in Lorenz equations. We can then approximate $\boldsymbol{f}$ by a generalized linear model $$ \boldsymbol{f}(\boldsymbol{x}) \approx \sum_{k=1}^p \theta_k(\boldsymbol{x})\xi_k = \boldsymbol{\Theta}(\boldsymbol{x})\boldsymbol{\xi} $$ with the fewest non-zero terms in $\boldsymbol{\xi}$ as possible. We first gather time-series data $\boldsymbol{X}\in\mathbb{R}^{n\times m}$ and its first-order-derivitives w.r.t. time $\dot{\boldsymbol{X}}\in\mathbb{R}^{n\times m}$ . The library of candidate nonlinear functions are represented as $$ \boldsymbol{\Theta}(\boldsymbol{X}) = \begin{bmatrix} \boldsymbol{1} & \boldsymbol{X} & \boldsymbol{X}^2 & \cdots & \boldsymbol{X}^d & \cdots & \sin(\boldsymbol{X}) & \cdots \end{bmatrix} $$ and the dynamical system can how be represented by $$ \dot{\boldsymbol{X}} = \boldsymbol{\Theta}(\boldsymbol{X})\boldsymbol{\Xi} $$ and the active term coefficients can be identified using regularized sparse regression e.g. using $l_1$ -penalty: $$ \xi_k = \arg\min_{\xi_k}\lVert\dot{\boldsymbol{X}}_k - \boldsymbol{\Theta}(\boldsymbol{X})\boldsymbol{\xi}_k\rVert_2 + \lambda \lVert \boldsymbol{\xi}_k\rVert_1 $$ We can use information criteria e.g. AIC to choose the most parsimonious model along the Pareto frontier
Koopman operator theory introduces a way to represent a nonlinear dynamical system in terms of an infinite-dimensional linear operator acting on a Hilbert space of measurement functions of the state of the system. Mathematically, consider real-valued measurement functions $g:\boldsymbol{M}\to\mathbb{R}$ which are elements of an infinite-dimensional Hilbert space. The functions $g$ are also commonly known as observables. Typically, the Hilbert space is given by the Lebesgue square-integrable functions on $\boldsymbol{M}$ . The Koopman operator $\mathcal{K}_t$ is an infinite-dimensional linear operator that acts on measurement functions $g$ as $$ \mathcal{K}_t g = g \circ \boldsymbol{F}_t $$ where $\circ$ is the composition operator. For a discrete-time system with time step $\Delta t$ , the above becomes $$ \mathcal{K}_{\Delta t} g(\boldsymbol{x}_k) = g(\boldsymbol{F}_{\Delta t}(\boldsymbol{x}_k)) = g(\boldsymbol{x}_{k+1}) \Rightarrow g(\boldsymbol{x}_{k+1}) = \mathcal{K}_{\Delta t} g(\boldsymbol{x}) $$ namely the Koopman operator defines an infinite-dimensional linear dynamical system that advances the observation to the next time-step.

For sufficiently smooth dynamical systems we can also define continuous-time analogue of the Koopman operator: $$ \frac{\mathrm{d}}{\mathrm{d} t}g = \mathcal{K} g $$ We can also leverage DMD, SINDy or delay coordinates to extend the concept of Koopman operator to a more data-driven manner. More discussion, while available in the book, is omitted here.

Linear Control Theory

Types of control:
%%{init:{"flowchart":{"curve":"linear"}}}%% flowchart TB A(( )):::empty A--passive---B(( )):::empty A--active---C(( )):::empty C--no sensors
open-loop---D(( )):::empty C--sensor-based---E(( )):::empty E--disturbance
feedforward---F(( )):::empty E--feedback
closed-loop---G(( )):::empty classDef empty fill:#000,stroke-width:0;
Correspondinly, their meanings:
- passive: no input energy needed, e.g. stop signs at intersections
- open-loop: no sensor needed, e.g. traffic signals that are pre-programmed to regulate the traffic flow
- feedforward: use disturbance to actively control the system, e.g. pre-emptively change the traffic light when a large crowd is expected to pass
- feedback: use sensors to measure the system directly and then shapes the control in response to whether the system is actually achieving the desired goal or not, e.g. smart traffic lights with a control logic informed by inductive sensors in the roadbed that measures traffic density
Closed-loop feedback systems:
%%{init:{"flowchart":{"curve": "natural"}}}%% flowchart LR subgraph sub[ ] direction TB A(System):::system B(Controller):::controller end S(( )):::empty T(( )):::empty S -- Disturbances
w --> sub sub -- Cost
j --> T A -- Sensors
y --> B B -- Actuators
u --> A classDef system fill:#B0C4DE,stroke:#000,stroke-width:1.5px; classDef controller fill:#F4A460,stroke:#000,stroke-width:1.5px; classDef empty fill:#000,stroke-width:0;
Denote $\boldsymbol{y}$ as the sensor measurements, $\boldsymbol{u}$ as the actuation signal, and $\boldsymbol{x}$ as the actual system state. Denote the exogenous disturbance as $\boldsymbol{w}=\begin{bmatrix}\boldsymbol{w}_d^T & \boldsymbol{w}_n^T & \boldsymbol{w}_r^T\end{bmatrix}^T$ , where $\boldsymbol{w}_d$ are disturbances to the state of the system, $\boldsymbol{w}_n$ are measurement noise and $\boldsymbol{w}_r$ represent a reference trajectory that should be tracked by the closed-loop system. A feedback system can then be mathematically described as $$ \begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{x} &= \boldsymbol{f}(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_d)\\ \boldsymbol{y} &= \boldsymbol{g}(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_n) \end{align*} $$ with the goal being to construct a control law $$ \boldsymbol{u} = \boldsymbol{k}(\boldsymbol{y},\boldsymbol{w}_r) $$ that minimizes a cost function $$ \boldsymbol{J}\overset{\Delta}{=}\boldsymbol{J}(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_r) $$ Kalman filter, for example, is just a special case wherre the control law is $\boldsymbol{u}=\boldsymbol{k}(\boldsymbol{y},\hat{\boldsymbol{x}},\boldsymbol{w}_r)$ where $\hat{\boldsymbol{x}}$ is the full-state estimate based on the measurements of $\boldsymbol{u}$ and $\boldsymbol{y}$
Linear time-invariant systems: consider the following simple system (can be seen as approximation of systems that satisfy $\boldsymbol{f}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})=\boldsymbol{0}$ near a fixed point $(\bar{\boldsymbol{x}}, \bar{\boldsymbol{u}})$ with Taylor expansion up to $\mathcal{O}(1)$ terms) $$ \begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{x} &= \boldsymbol{Ax} + \boldsymbol{Bu}\\ \boldsymbol{y} &= \boldsymbol{Cx} + \boldsymbol{Du} \end{align*} $$ Notice we don’t consider disturbance $\boldsymbol{w}_d$ and noice $\boldsymbol{w}_n$ here which will be added back when we introduce Kalman filtering
Unforced linear system: in the absence of control we have an unforced system $$ \frac{\mathrm{d}}{\mathrm{d}x}\boldsymbol{x} = \boldsymbol{Ax} $$ with solution given by $$ \boldsymbol{x}(t) = e^{\boldsymbol{A}t}\boldsymbol{x}(0) $$ where the matrix exponential is defined by $$ \begin{align*} e^{\boldsymbol{A}t} &= \boldsymbol{I} + \boldsymbol{A}t + \frac{1}{2}\boldsymbol{A}^2 t^2 + \frac{1}{3}\boldsymbol{A}^3 t^3 + \cdots \tag{using expansion}\\ &= \boldsymbol{Q}e^{\boldsymbol{\Lambda} t}\boldsymbol{Q}^{-1} \tag{using eigendecomposition} \end{align*} $$ and hence simplifying the solution further into $$ \boldsymbol{x}(t) = \boldsymbol{Q}e^{\boldsymbol{\Lambda} t}\boldsymbol{Q}^{-1}\boldsymbol{x}(0) $$ Where $\boldsymbol{Q}$ and $\boldsymbol{\Lambda}$ come from the eigendecomposition of $\boldsymbol{A}$ . It doesn’t only makes matrix exponential tractable, but also brings intuition in the system statbility, as we can prove that if all the eigenvalues of $\boldsymbol{A}$ have negative real part, then the system is stable; if any eigenvalue has positive real part, the system is unstable and will diverge from the fixed point along the corresponding unstable eigenvector direction
Forced linear system: with control $\boldsymbol{u}$ and zero initial condition $\boldsymbol{x}(0)=\boldsymbol{0}$ , we have solution $$ \boldsymbol{x}(t) = \int_0^t e^{\boldsymbol{A}(t-\tau)}\boldsymbol{Bu}(\tau)\mathrm{d} \tau \overset{\Delta}{=} e^{\boldsymbol{A}t}\boldsymbol{B}\ast \boldsymbol{u}(t) $$ namely the convolution of a kernel $e^{\boldsymbol{A}t}\boldsymbol{B}$ and the control input $\boldsymbol{u}(t)$ . The output is thus just $$ \boldsymbol{y}(t) = \boldsymbol{Cx}(t) = \boldsymbol{C}e^{\boldsymbol{A}t}\boldsymbol{B}\ast \boldsymbol{u}(t) $$
Controllability: consider the following controllability matrix $$ \mathcal{C} =\begin{bmatrix} \boldsymbol{B} & \boldsymbol{AB} & \boldsymbol{A}^2\boldsymbol{B} & \cdots & \boldsymbol{A}^{n-1}\boldsymbol{B} \end{bmatrix} $$ If the matrix $\mathcal{C}$ has n linearly independent columns, i.e. has (column) rank n, then the system is controllable. The span of the columns of $\mathcal{C}$ forms a Krylov subspace that determines which state vector directions in $\mathbb{R}^n$ may be manipulated with control, and thus implies arbitrary eigenvalue placement and reachability of any state $\boldsymbol{\xi}\in\mathbb{R}^n$ in finite time with certain actuation signal $\boldsymbol{u}(t)$
Observability: consider the following observability matrix $$ \mathcal{O} = \begin{bmatrix} \boldsymbol{C} \\ \boldsymbol{CA} \\ \boldsymbol{CA}^2 \\ \vdots \\ \boldsymbol{CA}^{n-1} \end{bmatrix} $$ If the matrix $\mathcal{O}$ has n linearly independent rows, i.e. has (row) rank n, then the system is observable. Interestingly, it’s worth noting that the observability criterion is mathematically the dual of the controllability criterion, more specifically, the observability matrix is the transpose of the controllability matrix for the same pair $(\boldsymbol{A},\boldsymbol{C})$ .
Linear-quadratic regulator (LQR) uses quadratic terms to the cost function and makes choosing the control law a well-posed optimization problem, for which there is a wealth of theoretical and numerical techniques: $$ J(t) = \int_0^t \boldsymbol{x}(\tau)^{\ast}\boldsymbol{Q}\boldsymbol{x}(\tau) + \boldsymbol{u}(\tau)^{\ast}\boldsymbol{R}\boldsymbol{u}(\tau)\mathrm{d}\tau $$ In fact, because the cost function is quadratic, there is an analytical solution for the optimal controller gains $\boldsymbol{K}_r$ given by $$ \boldsymbol{K}_r = \boldsymbol{R}^{-1} \boldsymbol{B}^{\ast} \boldsymbol{X} $$ where $\boldsymbol{X}$ is the solution to an algebraic Riccati equation (you can find the derivation in the book) $$ \boldsymbol{A}^{\ast}\boldsymbol{X} + \boldsymbol{XA} - \boldsymbol{XBR}^{-1}\boldsymbol{B}^{\ast}\boldsymbol{X} + \boldsymbol{Q} = \boldsymbol{0} $$ The schematic of the LQR is represented below
%%{init:{'flowchart':{'curve':'natural'}}}%% flowchart LR subgraph sub[ ] direction TB B(LQR
u=-Kx):::controller A(System
dx/dt=Ax+Bu):::system end C(( )):::empty A -- x --> B B -- u --> A A -- y=x --> C classDef system fill:#B0C4DE,stroke:#000,stroke-width:1.5px; classDef controller fill:#F4A460,stroke:#000,stroke-width:1.5px; classDef empty fill:#000,stroke-width:0;
The Kalman filter (KF) is the most commonly used full-state estimator as it optimally balances the competing effects of measurement noise, disturbances and model uncertainty. The dynamics are defined as $$ \begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\boldsymbol{x} &= \boldsymbol{Ax} + \boldsymbol{Bu} + \boldsymbol{w}_d \ \boldsymbol{y} &= \boldsymbol{Cx} + \boldsymbol{Du} + \boldsymbol{w}_n \end{align*} $$ where $\boldsymbol{w}_d$ and $\boldsymbol{w}_n$ corresponds to disturbance and noice. KF assumes that both $\boldsymbol{w}_d$ and $\boldsymbol{w}_n$ are zero-mean Gaussian with known covariances. The estimated system will be $$ \begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\hat{\boldsymbol{x}} &= \boldsymbol{A}\hat{\boldsymbol{x}} + \boldsymbol{Bu} + \boldsymbol{K}_f(\boldsymbol{y} - \hat{\boldsymbol{y}})\ \hat{\boldsymbol{y}} &= \boldsymbol{C}\hat{\boldsymbol{x}} + \boldsymbol{Du} \end{align*} $$
Linear-quadratic Gaussian (LQG) is another optimal sensor-based feedback framework combining Kalman filter and LQR: $$ \begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\hat{\boldsymbol{x}} &= (\boldsymbol{A} - \boldsymbol{K}_f \boldsymbol{C}d - \boldsymbol{B} \boldsymbol{K}_r)\hat{\boldsymbol{x}} + \boldsymbol{K}_f \boldsymbol{y}\ \boldsymbol{u} &= -\boldsymbol{K}_r \hat{\boldsymbol{x}} \end{align*} $$ which is the solution w.r.t. the following ensemble-averaged version of the cost function $$ J(t) = \langle \int_0^t [\boldsymbol{x}(\tau)^{\ast} + \boldsymbol{u}(\tau)^{\ast}\boldsymbol{R}\boldsymbol{u}(\tau)]\mathrm{d}\tau \rangle $$ Applying LQR to $\hat{\boldsymbol{x}}$ above results in the following closed-loop system (denoting $\boldsymbol{\varepsilon}=\boldsymbol{x}-\hat{\boldsymbol{x}}$ ): $$ \frac{\mathrm{d}}{\mathrm{d}t} \begin{bmatrix}\boldsymbol{x}\\\boldsymbol{\varepsilon}\end{bmatrix} = \begin{bmatrix} \boldsymbol{A} - \boldsymbol{BK}_r & \boldsymbol{BK}_r \\ \boldsymbol{0} & \boldsymbol{A} - \boldsymbol{K}_f\boldsymbol{C} \end{bmatrix} \begin{bmatrix} \boldsymbol{x} \\ \boldsymbol{\varepsilon} \end{bmatrix} + \begin{bmatrix} \boldsymbol{I} & \boldsymbol{0} \\ \boldsymbol{I} & -\boldsymbol{K}_f \end{bmatrix} \begin{bmatrix} \boldsymbol{w}_d \\ \boldsymbol{w}_n \end{bmatrix} $$ The LQG framework assumes accurate model of the system and knowledge of the magnitudes of the disturbances and measurement noise, which are assumed to be Gaussian processes. These are strong assumptions and likely violated in real data.
Inverted pendulum on a cart: this is a famous dynamical system example where the nonlinear system is given by $$ \begin{align*} \dot{x} &= v\\ \dot{v} &=\frac{-m^2L^2g\cos(\theta)\sin(\theta) + mL^2 (mL\omega^2\sin(\theta) - \delta v) + mL^2 u}{mL^2(M + m(1 - \cos(\theta)^2))}\\ \dot{\theta} &= \omega\\ \dot{\omega} &= \frac{(m+M)mgL\sin(\theta) - mL\cos(\theta)(mL\omega^2\sin(\theta)-\delta v) - mL\cos(\theta)u}{mL^2(M+m(1-\cos(\theta)^2))} \end{align*} $$ where $x$ is the cart position, $v$ is the velocity, $\theta$ is the pendulum angle, $\omega$ is the angular velocity, $m$ is the pendulum mass, $M$ is the cart mass, $L$ is the pendulum arm length, $g$ is the gravitational acceleration, $\delta$ is a friction damping on the cart, and $u$ is a control force applied to the cart
Robust control: it can be shown that LQG regulators can have arbitrarily small stability margins, making them fragile to model uncertainties, time delays and other model imperfections. We can generalize the optimial control framework here by incorporating a different cost function that penalizes worst-case scenario performance and thereby make the solution robust.
Laplace transform: can be seen as a one-sided generalized Fourier transform that is valid for functions that do not converge to zero at $t\to\infty$ . It’s particularly useful because it transforms differential equations into algebraic equations, and convolution integrals in the time domain into simple products in the frequency domain: $$ \mathcal{L}\left\{f(t)\right\} = f(s) = \int_{0^-}^{\infty}f(t)e^{-st}\mathrm{d}t $$ which gives without proof here: $$ \mathcal{L}\left\{\mathrm{d}f/\mathrm{d}t\right\} = sf(s) $$ Therefore a standard control system after Laplace becomes $$ \begin{align*} s\boldsymbol{x}(s) &= \boldsymbol{Ax}(s) + \boldsymbol{Bu}(s)\\ \boldsymbol{y}(s) &= \boldsymbol{Cx}(s) + \boldsymbol{Du}(s) \end{align*} $$ which is easily solved: $$ \begin{align*} \boldsymbol{x}(s) &= (s\boldsymbol{I} - \boldsymbol{A})^{-1}\boldsymbol{Bu}(s)\\ \boldsymbol{y}(s) &= [\boldsymbol{C}(s\boldsymbol{I} - \boldsymbol{A})^{-1}\boldsymbol{C} + \boldsymbol{D}]\boldsymbol{u}(s)\overset{\Delta}{=}\boldsymbol{G}(s)\boldsymbol{u}(s) \end{align*} $$ with $\boldsymbol{G}(s)$ defined above as the transfer function. For open-loop control (using system inversion) we need this $\boldsymbol{G}$ to be stable and to have more zeros than poles.

Balanced Models for Control

Model reduction: in many nonlinear systems it’s possible to use linear control techniques e.g. to delay transition from laminar to turbulent flow in a spatially developing boundary layer, to reduce skin-friction drag in wall turbulence, and to stabilize the flow past an open cavity
There are two broad approches to obtain reduced-order models (ROMs):
- Start with a high-dimensional system e.g. discretized Navier-Stokes equations and project the dynamics onto a low-dimensional subspace identified
- Collect data from a simulation or an experiment and identify a low-rank model using data-driven techniques a.k.a. system identification
Balanced model reduction: consider a high-dimensional system and reduce it’s complexity in the not-so-influential areas while keeping the overal important part
Eigensystem realization algorithm (ERA) produces low-dimensional linear input-output models from sesor measurement of an impulse-response experiment based on the “minimal realization " theory
Observer Kalman filter identification (OKID) was developed to complement the ERA for lightly damped experiemntal systems with noise

Advanced Data-Driven Modeling and Control

In this part we cover a bunch of data-driven algorithms available, specifically they are designed for systems that lack a principled model.

Data-Driven Control

%%{init:{"flowchart":{"curve":"natural"}}}%% flowchart LR A(System):::system B(Controller):::controller A -- Sensors
y --> B B -- Actuators
u --> A classDef system fill:#B0C4DE,stroke:#000,stroke-width:1.5px; classDef controller fill:#F4A460,stroke:#000,stroke-width:1.5px;

Model predictive control (MPC): adding weighted $l$ -2 norms of the control and the control changes to the cost function: $$ J(\boldsymbol{x}_i) = \sum_{k=0}^{m_p-1}\lVert \hat{\boldsymbol{x}}_{i+k}-\boldsymbol{r}_{i+k} \rVert_{\boldsymbol{Q}}^2 + \sum_{k=1}^{m_c-1}\left(\lVert\boldsymbol{u}_{i+k}\rVert_{\boldsymbol{R}}^2 + \lVert\Delta\boldsymbol{u}_{i+k}\rVert_{\boldsymbol{R}}^2 \right) $$ where $\boldsymbol{Q}$ and $\boldsymbol{R}$ are from the original LQR cost function, $m_p$ is the number of time steps of prediction, and $m_c$ is the number of time steps for the cost horizon purpose.
Nonlinear system identification for control: use system identification techniques on nonlinear control systems with full state measurement of $\boldsymbol{x}$
DMD with control (DMDc): it’s observed that naively applying DMD to data from a system with actuation would often result in incorrect dynamics as the effects of internal dynamics are confused with the effects of actuation. Instead, if the actuation signal is measured, a new DMD regression may be formulated in order to disambiguate the effect of internal dynamics from that of actuation and control. By including a matrix of actuation input history to the system $$ \boldsymbol{\Upsilon} = \begin{bmatrix} | & | & & | \\ \boldsymbol{u}_1 & \boldsymbol{u}_2 & \cdots & \boldsymbol{u}_m\\ | & | & & | \end{bmatrix} $$ we have the DMDc equations in matrix form: $$ \boldsymbol{X}' \approx \boldsymbol{AX} + \boldsymbol{B\Upsilon} $$ from which we can solve for $\boldsymbol{A}$ and $\boldsymbol{B}$ using least-squares regression and SVD for dimensionality reduction.
Koopman operator nonlinear control
SINDy with control
Machine learning control
Reinforcement learning
Iterative learning control: faster, not model-based, iterative optimization
Genetic algorithms: start with a population of different ginetic sequences and propagate through generations until converged
Genetic programming (GP): instead of searching for near-global optima, search for the best expression / program e.g. finding the dynamic system equations of certain dataset
Adaptive extremum-seeking control: similar to ILC but in continuous systems, updating parameters in real time searching for extremum of output

Reinforcement Learning

The basic feedback loop of a reinforcement learning schematic is like below
flowchart LR A("Agent
π(s,a,θ)"):::agent B(Environment):::env A --->|Action
a|B B -.->|Reward
r|A B --->|Observe State
s|A classDef agent fill:#ffa200,stroke-width:1.5px,stroke:black classDef env fill:#d47dff,stroke-width:1.5px,stroke:black
where the policy is often formulated as an optimization problem to give probabilistic action $$ \boldsymbol{\pi}(\boldsymbol{s},\boldsymbol{a})=\text{Pr}(\boldsymbol{a}|\boldsymbol{s})\approx\boldsymbol{\pi}(\boldsymbol{s},\boldsymbol{a},\boldsymbol{\theta}) $$ and the environment modeled as a Markov decision process (MDP) $$ P(\boldsymbol{s}',\boldsymbol{s},\boldsymbol{a}) = \text{Pr}(\boldsymbol{s}_{k+1}=\boldsymbol{s}'|\boldsymbol{s}_k=\boldsymbol{s},\boldsymbol{a}_k=\boldsymbol{a}) $$ with a reward function $$ R(\boldsymbol{s}',\boldsymbol{s},\boldsymbol{a}) = \text{Pr}(\boldsymbol{r}_{k+1}|\boldsymbol{s}_{k+1}=\boldsymbol{s}',\boldsymbol{s}_k=\boldsymbol{s},\boldsymbol{a}_k=\boldsymbol{a}) $$ The value function, given policy $\boldsymbol{\pi}$ , is then just $$ V_{\boldsymbol{\pi}}(\boldsymbol{s}) = \mathbb{E}\left(\sum_{k=0}^{\infty}\gamma^k\boldsymbol{r}_k | \boldsymbol{s}_0=\boldsymbol{s}\right) $$ where $\gamma$ is the discount rate, as future rewards should be discounted compared with the more valuable current rewards. This means the optimal policy is just $$ \boldsymbol{\pi} = \arg\max_{\boldsymbol{\pi}} \mathbb{E}(\boldsymbol{r}_0 + \gamma V(\boldsymbol{s}')) $$ where $\boldsymbol{s}'=\boldsymbol{s}_{k+1}$ . This is called the Bellman’s equation.
Categorization of RL techniques:
flowchart TB A[/Deep RL/]:::deep classDef deep fill:#fff,stroke-width:1.5px,stroke:#000 subgraph S1["Model-based RL"] direction TB B("Dynamic Programming (Bellamn Optimality)") D("Actor Critic") E("Deep MPC") B ~~~ D D ~~~ E end subgraph S2["Model-free RL"] subgraph SUB1["Gradient Free"] subgraph sub1["Off Policy"] direction TB F("Q-learning") G("DQN") F~~~G end subgraph sub2["On Policy"] H("SARSA") end sub1 ~~~ sub2 end subgraph SUB2["Gradient-based"] I("Deep Policy Network") J("Policy Gradient Optimization") I ~~~ J end SUB1 ~~~ SUB2 end S1 ~~~ S2 D --> A E --> A G --> A I --> A
We will talk about these one by one below.
Model-based optimization and control
- Dynamic programming (DP): Bellman equation with top-down or bottom-up approach
- Value iteration: update value of all states until converge
- Policy iteration: initialize and update value or policy iteratively until converge
Model-free reinforcement learning
- Monte Carlo (MC) learning: estimate the value of states or actions by averaging the returns after the whole trajectory
- Temporal differences (TD) learning: continuous counterparty of MC learning
- State-action-reward-state-action (SARSA) learning: a specific type of TD learning that is used to learn the Q function on-policy $$ Q^{\text{new}}(\boldsymbol{s}_k,\boldsymbol{a}_k) = \boldsymbol{Q}^{\text{old}}(\boldsymbol{s}_k,\boldsymbol{a}_k) + \alpha(R_{\Sigma}^{(n)} - Q^{\text{old}}(\boldsymbol{s}_k,\boldsymbol{a}_k)) $$ Note it’s on-policy because the actual action sequence has been used to receive the rewards $\boldsymbol{r}$ and evaluate the (n+1)-th step Q function
- Q-learning: slightly different from SARSA $$ Q^{\text{new}}(\boldsymbol{s}_k,\boldsymbol{a}_k) = Q^{\text{old}}(\boldsymbol{s}_k,\boldsymbol{a}_k) + \alpha(\boldsymbol{r}_k + \gamma \max_{\boldsymbol{a}}Q(\boldsymbol{s}_{k+1},\boldsymbol{a}) - Q^{\text{old}}(\boldsymbol{s}_k,\boldsymbol{a}_k)) $$ Notice here the target value is $\max_{\boldsymbol{a}} Q(\boldsymbol{s}_{k+1},\boldsymbol{a})$ and thus Q-learning is off-policy, as it uses the optimal action for the updates of the Q function. This makes experience relay possible (learn from watching the optimal alternative) and is closely related to how imitation learning works. We can also introduce $\varepsilon$ -greedy in taking $\max_{\boldsymbol{a}}Q$ which is related to simulated annealing from optimization, making the result more robust.
Deep reinforcement learning
- Deep Q-learning (DQN): approximate the matrix Q function as a neural network
- Actor-Critic networks: actor as policy-based, and critic as value-based, use temporal difference signal from the critic to update the policy
Optimal nonlinear control
- Hamilton-Jacobi-Bellman (HJB) equation extends the Bellman equation to a nonlinear system that’s defined by differential equations $$ -\frac{\delta V}{\delta t} = \min_{\boldsymbol{u}(t)}\left(\left(\frac{\delta V}{\delta /\boldsymbol{x}}\right)^T\!\!\!\boldsymbol{f}(\boldsymbol{x}(t),\boldsymbol{u}(t)) + \mathcal{L}(\boldsymbol{x}(t),\boldsymbol{u}(t))\right) $$ which gives the optimal policy for more general problems. The equation is usually solved numerically only.

Reduced-Order Models (ROMs)

Proper orthogonal decomposition (POD):
- Measure and observe data $\boldsymbol{X}$
- De-mean and get $\boldsymbol{Y}:=\boldsymbol{X} - \overline{\boldsymbol{X}}$
- SVD to get spatial patterns (modes) $\boldsymbol{\Phi}$ and temporal patterns $\boldsymbol{\Psi}$ of the original data $\boldsymbol{Y}=\boldsymbol{\Phi\Sigma\Psi}^{\ast}$
- Examine the modes and understand the primary spatial structures in the flow; the corresponding singular values in $\boldsymbol{\Sigma}$ tells about the mode importances
- Use the most significant modes to reconstruct an approximate version of the original dataset
Fourier mode expansion: just like Fourier transformation extends SVD in decomposition, here by replacing SVD with Fourier we have better (temporal) Interpretability, less noise sensitivity and faster computation.
Continuous-time and neural network versions of POD: skipped

Interpolation for Parametric Reduced-Order Models

This chapter is about sparse interpolation methods for rapid and low-dimensional construction of the ROMs.

Physics-Informed Machine Learning

Mathematical foundation of physics-informed ML is still data-driven system identification
Four critical aspects to developing a robust sensing framework
- Sensor placement
- Sensor cost
- Discovery of the measurement map (DNN can help)
- Multi-modal data integration from diverse sensor types (SVD, ROMs and SINDy autoencoder)
Prominent algorithms and frameworks:
- Fourier forecast
- Koopman forecast
- DeepONet (universal approximation theorem for nonlinear operators, namely NN can approximate not only functions, but also nonlinear operators, which is mapping from functions to functions)
- Physics-informed neural networks (PINN): use NN to approximate the solution of a PDE based on physic laws. The approximation aims to not only satisfy the PDE but also fit the data and its boundary/initial conditions
Boundary value problems (BVPs) involve solving differential equations where the solution is required to satisfy certain conditions at the boundaries of the domain. Deep learning helps here because
- It can approximate the system and turn nonlinear problems linear so that we can exploit their linear superposition
- It reduces the dimensionality which helps with both numerical computation burden and analytical difficulty