I am in a boot camp, but this is at
Simons, Berkeley, so more sandals than shoes.
Dick Karp and Alistair Sinclair started off the
Big Data program this Fall by setting the meeting in context, both the broader one in terms of Simons' effort to support theory (Dick) and in narrower context of the Big Data program of bringing researchers from many different areas together via series of tutorials during this
boot camp (Alistair). Michael Jordan, while being on a sabbatical across the Atlantic, managed to pull off organizing not only this boot camp, but also the entire program. Amazing! He has pulled together an enormous trove of talented researchers and everyone I talked to was pumped up and felt inspired to get something done this semester.
The scientific program started with Michael Jordan's two parts talk, one a breezy overview of Stats,CS and the interface, and the other, a dive into couple of results. This was phenomenal. Deep insights in simple sentences.
- Big data he said was data at great granularity: say, data about lot of people, with models needed per person. While with more of traditional resources (time/space/energy) one can do more in CS, it is not clear more data is necessarily better --- because more columns means more hypotheses --- if one is more ambitious than getting only the mean. He contrasted the confirmatory (model and fit) vs exploratory (collect and mine) approach to Big Data. He stated the ultimate goal: given inferential goal and computational bound, provide a procedure with guarantees, quality of inference need to increase monotonically as data accrue. It is not even clear such procedures exist across goals. He then contrasted frequentist (fixed model parameter, average over data) vs Bayesian (fixed data, average over model parameter) approaches and discussed sampling, consistency, convergence rates, etc.
- In the second half, he spoke about computing vs statistics tradeoff that alas, I sorta spaced out, and then he described his recent work I have followed for a year, on Bag of Bootstraps (BoB) which is very interesting. BoB looks at statistical problems with a large support and is a very general estimation procedure that breaks the problem down into a small number of problems with \sqrt{n} like support in 2 levels. There are a bunch of TCS type research problems of interest to me re. BoB that I hope to follow up.
The audience questions included, role of dimensions d vs data size n, lower bounds for estimators for risk/bias/variance, etc.
Joel Tropp, in a superhuman effort, spoke for 3 hours in one afternoon. He is a deliberate and precise speaker. He started with drawing the large arc of random matrices, from early days to now. Random matrix analyses are hard. He presented two simple, abstract principles (break random matrix into sum of independent ones and use exponential tailbounds on spectral properties) that he posited will be sufficient to do many of the random matrix analyses, if not to the finest bounds, good enough and more doable. He gave a few examples and then spent the rest of the lecture presenting matrix versions of bunch of inequalities that we know from non-matrix context (Bernstein, Markov, etc): these get remarkably subtle in form and proof. While many of us in TCS have worked with inequalities like this for specific problems we wished to solve and prove them as needed in the form we like, Joel has done a splendid job of abstracting these into well framed tools and polished/refined/applied them. Focusing only on the expository potential of this work, this is akin to what Spencer/Alon book/lecture notes does for the probabilistic method with the traditional tools. You can check out the video to see his highly polished, superb performance.
ps: It was good to see
Josh Bloom again. He is a man unconsciously full of remarkable quotes. He said, during our chat about the community of astronomers, "Astronomy is tiny".