I went through a phase in college when I had a preference for looking at problems through the linear-algebra lens. In the past two years, I have been looking at data that has a nice matrix form where rows are

subjects and columns are

attributes. The matrix is large and I need a

summary in order to understand it. The typical way to do this is to get a low-rank approximation which will give a few "directions" for an analyst to peer at, but unfortunately, these directions are linear combinations of the column (or row) space and typically, it is hard to understand intuitively what they represent. It will be incredibly useful if one could find

original columns or rows that summarized the matrix.

In the past few months,

Petros Drineas,

Michael Mahoney and I have worked on formulating this problem and solving it. Given matrix A, we are able to quickly find poly-k columns C (or rows or both) such that a matrix A_C obtained from these columns is a (1+\eps) approximation to best k-rank matrix A_k for A. Formally, ||A-A_C|| <= (1+\eps)||A-A_k||. The columns C then is a useful set of attributes for an analyst to understand and interpret. The technical result is

here from November 2005. ||.|| is the Frobenius norm.

There is a somewhat related problem in theory community: how to approximate A_k? The A_k that minimizes ||A-A_k|| is given immediately by SVD. In theory,

Alan Frieze, Ravi Kannan and Santosh Vempala asked if this could be approximated in a few passes. Since then, a lot of published work has shown how to get a matrix representation A' of A such that ||A-A'||<= ||A-A_k|| +\eps ||A||. This large additive error is discouraging and the question that remained open was, are there relative error approximations in few passes or in time less than it takes to solve the SVD? My result with Petros and Michael above provides such a matrix representation, but takes time to find largest k exact or approximate left-singular values, which can be done in O(k) passes using methods in Numerical Analysis. I mentioned my result above to Sariel Har-Peled at FSTTCS 2006 in India; at that time, he promised he can produce an A' such that ||A-A'||<= (1+\eps) ||A-A_k||. He delivered: his writeup is

here. This does not seem to address the problem Petros, Michael and I worked on since it does not output original columns or rows, but produces linear combinations. Quite recently,

Amit Deshpande and Santosh Vempala obtained a similar result but focused on using a small number -- O(klog k) -- passes, in some sense culminating a long line of research on low rank approximation using a small number of passes. It seems we can derive another solution to our problem from their result.

Summary: We are in a better place than we were a year ago when only additive approximations were known for low-rank matrix summaries. Both problems now have relative error approximations in near-linear time. And the fun part is, each of these results use a different technique. Bonanza! Those who are interested should look at these papers since there are many details.