A bunch of things happened that made me think about big data.
- A dinner conversation yesterday brought up NSF's general emphasis on data-intensive computing and specific example like the center on Foundations of Data and Visual Analytics at GaTech or supporting infrastructure for Hadoop/MapReduce/Azure in Academia. The following are given: there is a lot of data, lot of researchers want to analyze them, there are real performance bottlenecks in analyses they need, and we are in a position to throw processors at them. But IMHO, the CS research community has not yet abstracted clean models --- for building, for analyzing -- that will truly address this state of world. We need a Valiant-esque insight and effort here, something that threads between H/M/A approach for special tasks, PRAM/BSP for general parallel computing and currently popular multicores. Traditional appreciation of costs of moving data and computing state between processors or synchronicity don't hold, and one has to endogenize the fact that reading from a remote processor's main memory is cheaper than reading local disk given the communication infrastructure in data centers.
- Then there is the practice of handling large data. There is a new DHS center for Advanced Data Analysis (CCIADA) at Rutgers U. I am the Director of Data Research. Many organizations across the country have various datasets with complex rules for access, use, and at a higher level, what can be inferred from them. IMHO, the research community is far from formulating a model for working with such data constellations, with data mapping, provenance, and trust issues, a model which will support some algebra on top of the data and instinctively automate data handling issues. This is a big bottleneck for research to flourish.
- Finally, gmail ads are an interesting beast. Sometimes I am unware of them, some times I find them entertaining trying to figure what drives ad systems to map my emails to the specific ads they show, and once in a while, an ad sneaks up on me and I follow the lead. This morning I saw an ad for the Big Data Summit (last year's here). Like other industry meetings on this topic, this meeting too seems to bring together the right players who want to solve the problems, but I am not sure the industry has novel insights into the big problems here. Apologies for speaking without going to the summit.