Thursday, December 23, 2010

COMAD 2010

US theory conferences have their European counterparts, and lesser known Indian counterparts. Likewise with database conferences. COMAD is the Indian version of SIGMOD. I attended for a day. It was great to see database researchers I know from universities and labs (IBM to Yahoo!). Also, there were companies representing the myraid database industry in India, including those on Govt projects (eg., national biometric IDs, dmat stocks).

Divy Agrawal
gave an excellent plenary talk (I missed the other plenary talks). Cloud makes computing a commodity. Divy chose not to discuss security, self manageability, provisioning, etc and focused on database aspects. Larry Ellison said cloud is a computer connected to the network. There is debate, is cloud new? Divy argued it is real. There are many companies providing cloud services: azure, eucalyptus (Indian govt cloud), rackspace, salesforce, cloudera, Joyent, elastra, Tata, etc. From Ec2 to Azure, AppEngine and Force.com, they provide an increasing amount of management help. First, resource utilization is variable and even unpredictable. The Economics of Internet user is such that overprovisioning has some cost but underprovisioning has a huge cost (because users leave if not instantly gratified). Cloud offloads the risk of provisioning for the peak and makes sure one pays for the integral of usage, not integral of peak usage. Second, cloud is useful to reduce the time-to-market. For ex, consider the Facebook generation of app developers like Animoto.com which had to scale 50 to 3500 servers in 2 days. Cloud provides for such sprouts when needed.

Database technology can not be just replicated and made scalable for cloud. Instead, companies have built key-value stores such as HBASE, Hypertable, Amazon's SimpleDB, BigTable, etc. These give scalable elasticity, but limited consistency and flexibility which has irked traditional database researchers. Application writers now have to take care of consistency. For example, update user A to be a friend of user B and vice versa: should one handle it via exceptions, or via persistent queue (and eventual consistency), or.... This gets more complicated. Eventual consistency increases cost due to increase in programming complexity.

There are some general principles of cloud from data point of view. (a) Use distributed consensus to have implicit, hard consistency on system metadata, and leave application state out. (b) Limit application interactions to a single node, obviate the need for distributed synchronization. and (c) Decouple ownership of data from data storage so one can do lightweight transfer of data and contol.

Divy gave three examples of his projects at UCSB addressing data management issues in the cloud.
  • Data Fusion. GStore: efficient transactional multikey access. This is an atomic multikey access for applications such as online multiplayer games, collaborative applications, Facebook friends etc. Some commalities with Google AppEngine that allows MegaStore for multikey atomic access. Challenge they solve: when grouping multikeys and taking ownership, there could be failures. Showed expt results with EC2, groups of size 10's or 100's. Paper at ACM SoCC [Symp on Cloud Computing].
  • Data Fission. Make RDMCS cloud friendly. Idea: view the database as a collection of partitions.
  • Elasticity: Missing notes.
Concluding, Divy emphasized that cloud is a great leveler: 3 engineers in India can make a difference. In the future, what if computing needed multiple data centers, network edge needs to be integrated beyond content caching (ie hierarchy of centers) etc. Q: What about security of cloud infrastructure? It is the age of cyber wars. A: Replication helps. Threat always exists, we have to mitigate them. Q: What about distributed objects, ie., states existing across multiple machines. A: For future. Q: What are connections to scientific computing? A: Cloud is successful focusing on fine grained parallelism. We don't have good abstractions of scientific computing. In some sense, cloud is an evaluation of grid computing. Grid did not have virtualization, this is solved in cloud. Q: Notion of data independence, is it important? A: We are preaching to people who say consistency is not important, we have to start there. Q: Virtualization, why is not more critical in the talk? A. Virtualization is not carefully studied in database community (except JDBC), what does it mean for D/B? (virtualization has been studied in processor world). Great discussions. Divy brought seriousness and reasonableness to discussions about the cloud.

Beyond the plenary talk, I gave my VLDB tutorial on Data Management Problems in Ad Auction Systems. The talk was in two parts: first part on background math of auctions, second part on novel data mining problems. The first part may have turned off many, the second part had smaller audience, but generated serious discussions.

Finally, I liked a paper from TRDDC (see under publ here) where they proposed methods to tag entire sentences as being specific or general. The application was to analyze a survey about the company cafeteria filled out by employees. Some employees were terse, others verbose. This work had to analyze free text and if I recall right, they used the sentiwordnet. Nice study! Also, there was a detailed study from Yahoo! Bangalore on extracting attribute values from a webpage (like address of businesses) which was interesting.

Labels:

0 Comments:

Post a Comment

<< Home