PODS 13 Big Data Panel
Dan Suciu and Chris Re ran a panel at PODS yesterday on Big Data. Their premise was that there was a need for Big Data from industry to government, many CS research communities were already engaged, and PODS community needs to get involved and develop theoretical underpinnings.
It is always great to see the audience in a database conference, Ron Fagin and Mihalis Yannakakis in the front row.
- Joe Hellerstein went first and talked about core database perspective on Big Data, including synchronicity, distributed computing platforms, datalog specifications and separating computing and communication from declarative data management. He also spoke about his CALM conjecture in the context of consistency. Joe referred to himself as (really junior), and was as agile as you know him, mental and physical throughout the panel.
- Carlos Guestrin went next and spoke about machine learning at scale, in particular, graphical models and learning. In particular, he spoke about GraphLab, graph based machine learning methods. He presented impressive results on running triangle counting for large graphs. He also emphasized the vertex centric view of computations for graphical machine learning and wondered about formulating the precise power of this approach in a logic language.
- Sergei Vassilvitskii went next and spoke about the algorithmic perspective. While we have a bag of algorithmic tools for sequential algorithms (greedy, dyn pgm, LP rounding etc), we dont have significant tools for mapreduce++ algorithms. He posed coresets as a potential tools for the divide and conquer one needs with distributed machines. He spoke about applications in set cover, k-means and so on. Sergei was metaphorical, often referred to a donut and a cup (as the same thing), and I made a mental note to look for Krispy Kreme after the panel.
- Jeff Ullman followed and spoke about certain basic mapreduce algorithms. He discussed in detail the result in VLDB13 on one round mapreduce with tradeoffs between reducer size and replication rate for the Hamming distance problem, and left open the problem(s) for larger number of rounds. The lower bound in the main result is reminisent of the comparison based sorting lower bound.
- Andrew McCallum followed and spoke about information extraction from large corpus of academic research papers. One needs deep techniques to resolve the many conceptual problems that arise from text understanding to entity resolution, so the talk pointed to a variety of challenges in reasoning with probability and uncertainty, conditional random fields, etc. He also discussed their experiments with peer review process, as another front on improving the progress of science by focusing on the scientific community.
- I went last and mainly spoke about how Big Data is different from Massive Data because Big Data seems to deal with people. So, we need to accept that data is generated by strategic agents, query results is consumed by strategic agents which may ultimately affect if the database will get more, quality data or not, and we should draw a circle around data to include these aspects, ie., privacy, economics and game theory. Further, instead of BDDB that is general purpose, we could focus on Big Purpose databases.
It is always great to see the audience in a database conference, Ron Fagin and Mihalis Yannakakis in the front row.
4 Comments:
Nice comments Muthu, and overly generous as usual. I'm not sure you did justice to your own words. To distill two key points I took from your bit:
-There is now a user expectation that the database is not accurate, which makes prior work on approximate queries much easier to adopt now.
- More intriguing yet, there is a user expectation that the database is being gamed -- that the data being inserted is there to influence the outputs, and therefore the process generating outputs influences what gets put in.
I can't remember your exact words, but there was a point that the database is an organism that needs to evolve to survive.
And then you asked the big question: how big of a purview should the technical community (PODS in this case) take on? Just to answer individual queries? To account for the feedback loop?
Good stuff.
Dear JMH,
Thanks for reminding me, yes, I am pretty excited by all these possibilities for Big Data seen from DB perspective...
- Metoo
I'm far removed from the database world but I do work with very high dimensional sparse brain datasets. One of the distinctions being made between big data and massive data is that the latter is static. If the data I analyze changes to include a time series I suspect it would still be massive data though (I'm thinking of structural MRI data and fMRI data if you are familiar with these). So is Big Data something that also implies a degree of uncertainty?
I'm far eliminated from the data source globe but I do perform with very great perspective rare mind datasets. One of the differences being created between big information and large information is that the latter is fixed. If the information I evaluate changes to consist of a period of time series
rs gold for sale
Post a Comment
<< Home