Monday, June 24, 2013

PODS 13 Big Data Panel

Dan Suciu and Chris Re ran a panel at PODS yesterday on Big Data. Their premise was that there was a need for Big Data from industry to government, many CS research communities were already engaged, and PODS community needs to get involved and develop theoretical underpinnings.

  • Joe Hellerstein went first and talked about core database perspective on Big Data, including synchronicity, distributed computing platforms, datalog specifications and separating computing and communication from declarative data management. He also spoke about his CALM conjecture in the context of consistency. Joe referred to himself as (really junior), and was as agile as you know him, mental and physical throughout the panel.
  • Carlos Guestrin went next and spoke about machine learning at scale, in particular, graphical models and learning. In particular, he spoke about GraphLab, graph based machine learning methods. He presented impressive results on running triangle counting for large graphs. He also emphasized the vertex centric view of computations for graphical machine learning and wondered about formulating the precise power of this approach in a logic language.
  • Sergei Vassilvitskii went next and spoke about the algorithmic perspective. While we have a bag of algorithmic tools for sequential algorithms (greedy, dyn pgm, LP rounding etc), we dont have significant tools for mapreduce++ algorithms. He posed coresets as a potential tools for the divide and conquer one needs with distributed machines. He spoke about applications in set cover, k-means and so on. Sergei was metaphorical, often referred to a donut and a cup (as the same thing), and I made a mental note to look for Krispy Kreme after the panel.
  • Jeff Ullman followed and spoke about certain basic mapreduce algorithms. He discussed in detail the result in VLDB13 on one round mapreduce with tradeoffs between reducer size and replication rate for the Hamming distance problem, and left open the problem(s) for larger number of rounds. The lower bound in the main result is reminisent of the comparison based sorting lower bound.
  • Andrew McCallum followed and spoke about information extraction from large corpus of academic research papers. One needs deep techniques to resolve the many conceptual problems that arise from text understanding to entity resolution, so the talk pointed to a variety of challenges in reasoning with probability and uncertainty, conditional random fields, etc. He also discussed their experiments with peer review process, as another front on improving the progress of science by focusing on the scientific community. 
  • I went last and mainly spoke about how Big Data is different from Massive Data because Big Data seems to deal with people. So, we need to accept that data is generated by strategic agents, query results is consumed by strategic agents which may ultimately affect if the database will get more, quality data or not, and we should draw a circle around data to include these aspects, ie., privacy, economics and game theory. Further, instead of BDDB that is general purpose, we could focus on Big Purpose databases.
The talks were followed by a panel discussion, which was lively. Tina Eliassi-Rad asked if we consider generative models of data, and Joe pointed out his 10 yr old paper that does it for the case of acquiring sensor data.  Christoph asked Carlos about the relative emphasis on ML vs DB in ML and DB conferences, and Carlos said each community need to be more like the other. Christophe also pointed out that DB folks may have to become knowledgeable about a lot of other areas in order to deal with Big Data problems. I asked the audience to think about whether the venture funds go towards Big Data or Big Data Applications. C. Mohan mentioned that from a recent facebook meeting, it seemed like there was significant VC activity in this area. Carlos and Joe are embarking on their adventures, thanks to VCs.

It is always great to see the audience in a database conference, Ron Fagin and Mihalis Yannakakis in the front row.


Anonymous Anonymous said...

Nice comments Muthu, and overly generous as usual. I'm not sure you did justice to your own words. To distill two key points I took from your bit:

-There is now a user expectation that the database is not accurate, which makes prior work on approximate queries much easier to adopt now.

- More intriguing yet, there is a user expectation that the database is being gamed -- that the data being inserted is there to influence the outputs, and therefore the process generating outputs influences what gets put in.

I can't remember your exact words, but there was a point that the database is an organism that needs to evolve to survive.

And then you asked the big question: how big of a purview should the technical community (PODS in this case) take on? Just to answer individual queries? To account for the feedback loop?

Good stuff.

8:52 AM  
Anonymous Anonymous said...

Dear JMH,

Thanks for reminding me, yes, I am pretty excited by all these possibilities for Big Data seen from DB perspective...

- Metoo

6:39 AM  
Anonymous Meena said...

I'm far removed from the database world but I do work with very high dimensional sparse brain datasets. One of the distinctions being made between big data and massive data is that the latter is static. If the data I analyze changes to include a time series I suspect it would still be massive data though (I'm thinking of structural MRI data and fMRI data if you are familiar with these). So is Big Data something that also implies a degree of uncertainty?

4:35 AM  
Anonymous rs gold said...

I'm far eliminated from the data source globe but I do perform with very great perspective rare mind datasets. One of the differences being created between big information and large information is that the latter is fixed. If the information I evaluate changes to consist of a period of time series
rs gold for sale

11:33 PM  

Post a Comment

<< Home