## Friday, September 22, 2017

### Mutual information and information equilibrium

Natalie Wolchover has a nice article on the information bottleneck and how it might relate to why deep learning works. Originally described in this 2000 paper by Tishby et al, the information bottleneck was a new variational principle that optimized a functional of mutual information as a Lagrange multiplier problem (I'll use their notation):

$$\mathcal{L}[p(\tilde{x} | x)] = I(\tilde{X} ; X) - \beta I(\tilde{X} ; Y)$$

The interpretation is that we pass the information the random variable $X$ has about $Y$ (i.e. the mutual information $I(X ; Y)$) through the "bottleneck" $\tilde{X}$.

Now I've been interested in the connection between information equilibrium, economics, and machine learning (e.g. GAN's real data, generated data, and discriminator have a formal similarity to information equilibrium's information source, information destination, and detector — the latter I use as a possible way of understanding of demand, supply, and price on this blog). I'm always on the lookout for connections to information equilibrium. This is a work in progress, but I first thought it might be valuable to understand information equilibrium in terms of mutual information.

The best way to illustrate this is with a Venn diagram:

If we have two random variables $X$ and $Y$, then information equilibrium is the condition that:

$$H(X) = H(Y)$$

Without loss of generality, we can identify $X$ as the information source (effectively a sign convention) and say in general:

$$H(X) \geq H(Y)$$

We can say mutual information is maximized when $Y = f(X)$. The diagram above represents a "noisy" case where either noise (or another random variable) contributes to $H(Y)$ (i.e. $Y = f(X) + n$). Mutual information cannot be greater than the information in $X$ or $Y$. And if we assert a simple case of information equilibrium (with information transfer index $k = 1$), e.g.:

$$p_{xy} = p_{x}\delta_{xy} = p_{y}\delta_{xy}$$

then

\begin{align} I(X ; Y) & = \sum_{x} \sum_{y} p_{xy} \log \frac{p_{xy}}{p_{x}p_{y}}\\ & = \sum_{x} \sum_{y} p_{x} \delta_{xy} \log \frac{p_{x} \delta_{xy} }{p_{x}p_{y}}\\ & = \sum_{x} \sum_{y} p_{x} \delta_{xy} \log \frac{\delta_{xy} }{p_{y}}\\ & = \sum_{x} p_{x} \log \frac{1}{p_{x}}\\ & = -\sum_{x} p_{x} \log p_{x}\\ & = H(X) \end{align}

Note that in the above, the information transfer index accounts for the "mismatch" in dimensionality in the Kronecker delta (i.e. a die roll that determines the outcome of a coin flip such that a roll of 1, 2, or 3 yields heads and 4, 5, or 6 yields tails).

Basically, information equilibrium is the case where $H(X)$ and $H(Y)$ overlap, $Y = f(X)$, and mutual information is maximized.