## Saturday, October 7, 2017

### Compressed sensing and the information bottleneck

For those that don't know, my day job is actually in signal processing research and development in the aerospace sector. As I document in my book, I came by economics research via a circuitous route. One subject I worked on for awhile (and still do to some extent) is called compressed sensing (Igor Carron's blog is a great way to keep up with the state of the art in that field, and his Google site provides a nice introduction to the subject).

One of the best parts about Igor's blog is that he brings together several lines of research from machine learning, matrix factorization, compressed sensing, and other fields and frequently finds connections between them (they sometimes appear in his regular feature "Sunday Morning Insight").

In that spirit — although more of  a Saturday Afternoon Insight — I thought I'd put a thought out there. I've been looking at how the price mechanism relates to the information bottleneck (here, here), but I've also mused about a possible connection between the price mechanism and compressed sensing. I think now there might be a connection between compressed sensing and the information bottleneck.

In compressed sensing, you are trying to measure a sparse signal (a signal that appears in only a sparse subset of your space x, like a point of light in a dark image or a single tone in a wide bandwidth). To do so, you set up your system to make measurements in what is called the dense domain — through some mechanism (Fourier transform, random linear combinations, labeled with Φ) you make the variable you wish to measure appear throughout the space y. Therefore a few random samples of the entire dense space give you information about your sparse signal, whereas a few random samples of an image with a single bright point would likely only return dark pixels with no information about the point.

Is this how the information bottleneck works? We have some domain X in which our signal is just a small part (the set of all images vs the set of images of cats), and we train a feedforward deep neural network (DNN, h1 h2 → ... hm) that creates a new domain Y where our signal information is dense (cat or no cat). Every sample of that domain tells us information about whether there is an image of a cat being fed into the DNN (i.e. if it identifies cats and dogs, a result of dog tells us it's not a cat).

In compressed sensing, we usually know the some properties about the signal that allow us to construct the dense domain (sparse images of points can be made dense by taking a 2D Fourier transform). However, random linear combinations can frequently function as a way to make your signal dense in your domain. In training a DNN, are we effectively constructing a useful random projection of the data in the sparse domain? As we push through the information bottleneck, are we compressing the relevant information into a dense domain?

The connection between compressed sensing and the structure of a neural net has been noted before (see e.g. here or here), the new part (for me at least) is the recognition of the information bottleneck as a useful tool to understand compressed sensing — "opening the black box" of compressed sensing.

...

Update 19 May 2019

This might also be connected:
Matrices of (approximate) low rank are pervasive in data science, appearing in recommender systems, movie preferences, topic models, medical records, and genomics. While there is a vast literature on how to exploit low rank structure in these datasets, there is less attention on explaining why the low rank structure appears in the first place. Here, we explain the effectiveness of low rank models in data science by considering a simple generative model for these matrices: we suppose that each row or column is associated to a (possibly high dimensional) bounded latent variable, and entries of the matrix are generated by applying a piecewise analytic function to these latent variables. These matrices are in general full rank. However, we show that we can approximate every entry of an m×n matrix drawn from this model to within a fixed absolute error by a low rank matrix whose rank grows as O(log(m+n)). Hence any sufficiently large matrix from such a latent variable model can be approximated, up to a small entrywise error, by a low rank matrix.
Paper link on arXiv — almost all large enough matrices are effectively low rank.

1. Wow, something I've heard of before! I actually wrote a SBIR Phase I proposal on this topic back in ... maybe 2008 or so? I didn't win... it was a shot in the dark because I had to read up on compressed sensing before writing the proposal. Lol. An RF design engineer helped me come up with the proposed system architecture.

It's funny because I thought the same thing reading your information bottleneck pieces earlier: "I wonder if this has something to do with that proposal I wrote years ago?" ... I'd forgotten the name "compressed sensing." So thanks for reminding me.

1. That was a big time for CS. I actually was in on a review of a Rice University project implementing a CS sensor and wrote a big report about CS while on my government fellowship.

Everybody seemed to want to use it incorrectly. CS is for situations where you either a) are already sampling the dense domain with few measurements, or b) a small reduction in samples is a big "utility" gain. The prime example is MRI (already sampling dense domain and a 50% reduction in collection time [usually hours] is a bid deal).

2. Regarding finding a sparse signal in noise, I also recently (a year ago?) worked on an implementation of a "sparse FFT" in FPGAs. I don't think the papers I based it on mentioned "compressed sensing" but it seemed very similar. The idea was to use much smaller FFTs to do the work of locating sparse bandlimited signals that you'd normally need to do a large FFT to find (in frequency).

1. One of the things that happened was that the "hype" of compressed sensing started to reach the "trough of disillusionment" and people started using different words -- one of which was sparse reconstruction. The sparse FFT from MIT is actually a compressed sensing approach.

Also, try to avoid the use of dollar signs as they interfere with my setup of mathjax. I left it set up that way because I think this is funny for an economics blog. You can use € or £ instead.