In Independent Component Analysis (ICA), the assumption is that our data X is a linear mixture of statistically independent sources S, i.e. X = WS where W is the mixing matrix. The goal is to invert this mixing matrix to get the sources back. This famously solves the cocktail party problem, where we have sound files of overlapping conversations that we can separate.
The sources are retrieved either by maximizing their non-Gaussianity (you don’t wanna know about it here, but a mixture of random variables is more Gaussian then each term due to the central limit theorem) or by minimizing the mutual information between each component — the focus of the rest of the post.
Consider the mutual information of variables A and B:
It measures how much you learn about one variable when you observe the other. The right-hand side is the KL divergence between their joint distribution and the product of marginals. The mutual information of A and B being 0 implies that they are independent, as the KL divergence measures distances between probability distributions. As the KL divergence is defined via integrals that are (usually) intractable, we need to estimate it using other means.
Let E(X) = S′ denote our parameterized vector-valued function to extract the independent sources contained in the vector S . To make the outputs statistically independent, we want to penalize high mutual information between each output and all the others. The work on Mutual information neural estimators (MINEs) introduces a loss function for a neural network M to estimate a tight lower bound of the mutual information between some random variables:
The -i index denotes the vector of all sⱼ’s except for sᵢ. The left term indicates the expected value over the joint distribution and the right term is the expected value over the marginals.
It’s clear that we want to maximize this quantity for M to make for an accurate estimation of the mutual information between the outputs of E. But E wants this quantity to be small to extract the independent components. Optimizing systems with a set of contradicting goals in a system is well understood, with iterative optimization employed in methods ranging from classic expectation–maximization to Goodfellow’s more recent generative adversarial networks.
Including a differentiable sphering layer was again crucial for this method to work. We hypothesize that this layer simplifies the computational problem since statistically independent random variables are necessarily uncorrelated. As a proof of concept, we solved a humble blind source separation problem:
Training two networks, one for generating solution components and one for estimating the mutual information between them, works well. Each training epoch of the encoder is followed by seven training epochs of M. Estimating the exact mutual information is not essential, so few iterations suffice for a good gradient direction.
In practice, we make K copies of the estimator function and have each one handle the estimation of the mutual information between a component and the rest. We found that this works better than training K separate estimators for each component-rest tuple. The benefit of sharing weights between the estimators in this manner shows that feature re-use is valuable between the different estimators.
Learning Gradient-Based ICA by Neurally Estimating Mutual Information