_{Photograph by Jürgen Schoner / CC BYSA 3.0}
A brief introduction to Slow Feature Analysis
Hlynur Davíð Hlynsson, October 20. 2017
I just started PhD studies at Ruhr University Bochum. One of the main research topics of the group I joined is called Slow Feature Analysis. To learn about a new topic, I like seeing examples and intuitive explanations if possible before submerging myself in mathematical rigor. I wrote this blog post for others who like approaching subjects in a similar manner, as I believe that this method is quite powerful and interesting.
In this post I'll lead with an example application of SFA to briefly motivate the method. Then I'll go into more detail about the math behind the method and finally provide links to other good resources on the material.
Table of contents
1. Determining a smooth latent variable
Slow feature analysis (SFA) is an unsupervised learning method to extract the smoothest (slowest) underlying functions or features from a time series. This can be used for dimensionality reduction, regression and classification. For example, we can have a highly erratic series that is determined by a nicer behaving latent variable.
Let's generate time series $D$ and $S$ with elements:
$$ d_t = \sin\left(\frac{\pi}{75} t\right)  \frac{1}{150}t $$
$$ s_t = (3.7 + 0.35 d_t) \cdot s_{t1} \cdot (1  s_{t1})\ $$
$$s_0 = 0.6,\ t = 1, \dots, 300$$
where $s_0 = 0.6$ and $t = 1, \dots, 300.$
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
series_length = 300
S = np.zeros((series_length, 1), 'd')
D = np.zeros((series_length, 1), 'd')
S[0] = 0.6
for t in range(1, series_length):
D[t] = np.sin(np.pi/75. * t)  t/150.
S[t] = (3.7+0.35*D[t]) * S[t1] * (1  S[t1])
This is known as the logistic map. By plotting the series $S$, we can inspect it's chaotic nature. The underlying time series $D$ that's driving the behaviour of the curve above is much simpler:
plt.subplot(211)
plt.plot(np.arange(series_length), S, 'r', linewidth=0.7, label="S")
plt.legend(loc='upper right')
plt.ylabel('$s_t$')
plt.subplot(212)
plt.plot(np.arange(series_length), D, label="D")
plt.legend(loc='upper right')
plt.xlabel('time t')
plt.ylabel('$d_t$')
plt.show()
How can we determine the simple, underlying driving force from the erratic time series?
We can use SFA to determine the most slowly varying features of a function. In our case, we would start off with data like $S$ and end up with $D$, without necessarily knowing beforehand how $S$ is generated.
Implementations of SFA aim at finding features of the input that are linear. But as we can see from our example, the driving force $D$ is highly nonlinear! This can be remedied by doing a nonlinear expansion of the time series $S$ first, then finding linear features of the expanded data. By doing this we find nonlinear features of the original data.
Let's create a new multivariate time series by stacking time delayed copies of $S$ on it:
\begin{bmatrix} s_{1} & s_{2} & s_{3} & \dots & s_{300} \\ \end{bmatrix}
$$\downarrow$$
\begin{bmatrix} s_{1} & s_{2} & s_{3} & \dots & s_{297} \\ s_{2} & s_{3} & s_{4} & \dots & s_{298} \\ s_{3} & s_{4} & s_{5} & \dots & s_{299} \\ s_{4} & s_{5} & s_{6} & \dots & s_{300} \end{bmatrix}
%%capture
import mdp
timeframes = mdp.nodes.TimeFramesNode(4)
timeframed_S = timeframes.execute(S)
Next we do a cubic expansion of the series. For example, at time t=1, our 4 dimensional vector $[s_1, s_2, s_3, s_4]^T$ now becomes the 34 element vector $[s_t^3,\ s_t^2,\ s_t s_u s_v,\ s_t s_v,\ s_t]^T$ for distinct $t, u, v \in \{1, 2, 3, 4\}$
cubic_expand = mdp.nodes.PolynomialExpansionNode(3)
cubic_expanded_S = cubic_expand(timeframed_S)
sfa = mdp.nodes.SFANode(output_dim=1)
slow = sfa.execute(cubic_expanded_S)
Keep in mind that the best number of time delayed copies to be added varies from problem to problem. Alternatively, if the original data is too highdimensional then dimensionality reduction needs to be done, for example with Principal Component Analysis. Consider thus the following to be the hyperparameters of the method: the method of dimensionality expansion (reduction), the output dimension after expansion (reduction) and the number of slow features to be found.
After adding the time delayed copies the length of the time series changed from 300 to 297. The corresponding length of the slow feature time series is thus 297 as well. For nicer visualization here, we turn it to length 300 by prepending the first value to it and appending the last value two times. The features found by SFA have zero mean and unit variance, so we normalize $D$ as well before visualizing the results.
slow = slow.flatten()
padded_slow = np.concatenate([[slow[0]], slow, [slow[296]], [slow[296]]])
rescaled_D = (D  np.mean(D, 0)) / np.std(D, 0)
Even considering only 300 data points, the SFA features manage to almost completely recover the underlying source which is quite impressive!
plt.plot(np.arange(series_length), rescaled_D, 'b', label='Normalized D')
plt.plot(np.arange(series_length), padded_slow, 'g', lw = 3, label='SFA estimation of D')
plt.ylabel('$d_t$')
plt.xlabel('time t')
plt.legend(loc='upper right')
plt.show()
2. So what's going on under the hood?
Theoretically, the SFA algorithm accepts as input a (multivariate) time series $\textbf{X}$ and an integer $m$ indicating the number of features to extract from the series, where $m$ is less than the dimension of the time series. The algorithm determines $m$ functions $g_j(X_t)$ such that the average of the squared time derivative of two successive time points of each $g_j$ is minimized. Intuitively, we want to maximize the slowness of the features. Writing $g_j(X_t) = y_j(t)$, we want to:
$$ y_i(t) = g_i(X_t) $$
\begin{equation} \begin{aligned} & \underset{y_i}{\text{minimize}} & & \mathbb{E}{\left( \dot{y}_i^2 \right)}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) \\ & \text{subject to} & & \mathbb{E}{\left(y_j \right)} = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) \\ &&& \mathbb{E}{\left(y^2_j \right)}= 1 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \\ & & & \mathbb{E}{\left(y_i y_j \right)} = 0, \ i \lt j \ \ \ \ \ (4) \end{aligned} \end{equation}
where $$\dot{y}_i(t) = y_i(t+1)  y_i(t)$$
The objective function $\mathbb{E}{\left( \dot{y}_i^2 \right)}$ measures the slowness of the feature. The zeromean constraint, $\mathbb{E}{\left(y_j \right)} = 0$, makes the second moments and variance of the data equivalent and simplifies the notation. The unit variance constraint, $\mathbb{E}{\left(y^2_j \right)}= 1$, discards constant solutions.
The final constraint, $\mathbb{E}{\left(y_i y_j \right)} = 0, \ i \lt j$, decorrelates our features and induces an ordering on their slowness. This means that we first find the slowest feature, then we find that next slowest feature that is orthogonal to the one before it and so on. Decorrelating the features ensures that we capture the most information.
Now, let's consider only linear features $$ y_i(t) = W_i^TX_t$$ The time series X can be "raw data" or its nonlinear expansion, see example above. Remember that even though those are linear features of the expanded data, they can still be nonlinear features of the original data.
In the following I glance over a lot of details but I want to include it for completeness. I suggest looking at the links below for more thorough explanations.
Assuming zeromean $\textbf{X}$, the linear features are found by solving the generalized eigenvalue problem $\textbf{A}\textbf{W} = \textbf{B}\textbf{W}\Lambda$. We determine $m$ eigenvalueeigenvector tuples ($\lambda_i$,$W_i$) such that $\textbf{A}W_i = \lambda_i \textbf{B} W_i$, where $$\textbf{A} :=\mathbb{E}{\left(\dot{X}_t\dot{X}_t^T \right)}$$ $$ \textbf{B} :=\mathbb{E}{\left(X_t X_t^T \right)}$$. The scalars $\lambda_i$ signify the slowness of the features, i.e. $\lambda_i < \lambda_{j} \iff y_i$ is more slowly varying than $y_{j}$. Note that because of the ordering constraint, $\lambda_i < \lambda_{i1}$ for all $i$. If you are familiar with the generalized eigenvalue problem, note as well that the eigenvalues here are increasing  not decreasing. The eigenvectors $W_i$ are the transformation vectors that define our learned features.
3. Further reading

The original paper: https://www.ini.rub.de/PEOPLE/wiskott/Reprints/WiskottSejnowski2002NeurCompLearningInvariances.pdf

Application of SFA to classification: http://cogprints.org/4104/1/Berkes2005apreprint.pdf

Example above is adapted from http://mdptoolkit.sourceforge.net/examples/logmap/logmap.html#logmap