Thursday, December 23, 2010

Simulating Starbucks

Say Starbucks stands for the precise form of coffee you like. When you travel, how can you reproduce that? Starbucks is not found everywhere.

My Starbucks is 8--10 ounces of drip coffee in the morning, not very strong, nearly black. During travels, this is difficult to get and I have to improvise. In Italy, I have to order "due cafe americano, lungo". Each americano is espresso and a separate glass of hot water, lungo makes it more espresso, "due" makes it two, and when mixed in a large cup (if you can find one of those), this will be a reasonable substitute. In India, I am mixing coffees to simulate this. Two Nescafe cappucinos, one with sugar, one without sugar, ... The experiment is still on.

Anecdote: I remember a while ago I was walking back from lunch with Dan Spielman and Ravi Kannan. I was engrossed in my discussion with Dan, and noticed after a while that Ravi was missing. Dan guessed that Ravi must have gone to his favorite coffee shop. We headed there, and indeed there was Ravi, having discovered the substitute for *his* Starbucks: a strong, South Indian, filter coffee!

COMAD 2010

US theory conferences have their European counterparts, and lesser known Indian counterparts. Likewise with database conferences. COMAD is the Indian version of SIGMOD. I attended for a day. It was great to see database researchers I know from universities and labs (IBM to Yahoo!). Also, there were companies representing the myraid database industry in India, including those on Govt projects (eg., national biometric IDs, dmat stocks).

Divy Agrawal
gave an excellent plenary talk (I missed the other plenary talks). Cloud makes computing a commodity. Divy chose not to discuss security, self manageability, provisioning, etc and focused on database aspects. Larry Ellison said cloud is a computer connected to the network. There is debate, is cloud new? Divy argued it is real. There are many companies providing cloud services: azure, eucalyptus (Indian govt cloud), rackspace, salesforce, cloudera, Joyent, elastra, Tata, etc. From Ec2 to Azure, AppEngine and, they provide an increasing amount of management help. First, resource utilization is variable and even unpredictable. The Economics of Internet user is such that overprovisioning has some cost but underprovisioning has a huge cost (because users leave if not instantly gratified). Cloud offloads the risk of provisioning for the peak and makes sure one pays for the integral of usage, not integral of peak usage. Second, cloud is useful to reduce the time-to-market. For ex, consider the Facebook generation of app developers like which had to scale 50 to 3500 servers in 2 days. Cloud provides for such sprouts when needed.

Database technology can not be just replicated and made scalable for cloud. Instead, companies have built key-value stores such as HBASE, Hypertable, Amazon's SimpleDB, BigTable, etc. These give scalable elasticity, but limited consistency and flexibility which has irked traditional database researchers. Application writers now have to take care of consistency. For example, update user A to be a friend of user B and vice versa: should one handle it via exceptions, or via persistent queue (and eventual consistency), or.... This gets more complicated. Eventual consistency increases cost due to increase in programming complexity.

There are some general principles of cloud from data point of view. (a) Use distributed consensus to have implicit, hard consistency on system metadata, and leave application state out. (b) Limit application interactions to a single node, obviate the need for distributed synchronization. and (c) Decouple ownership of data from data storage so one can do lightweight transfer of data and contol.

Divy gave three examples of his projects at UCSB addressing data management issues in the cloud.
  • Data Fusion. GStore: efficient transactional multikey access. This is an atomic multikey access for applications such as online multiplayer games, collaborative applications, Facebook friends etc. Some commalities with Google AppEngine that allows MegaStore for multikey atomic access. Challenge they solve: when grouping multikeys and taking ownership, there could be failures. Showed expt results with EC2, groups of size 10's or 100's. Paper at ACM SoCC [Symp on Cloud Computing].
  • Data Fission. Make RDMCS cloud friendly. Idea: view the database as a collection of partitions.
  • Elasticity: Missing notes.
Concluding, Divy emphasized that cloud is a great leveler: 3 engineers in India can make a difference. In the future, what if computing needed multiple data centers, network edge needs to be integrated beyond content caching (ie hierarchy of centers) etc. Q: What about security of cloud infrastructure? It is the age of cyber wars. A: Replication helps. Threat always exists, we have to mitigate them. Q: What about distributed objects, ie., states existing across multiple machines. A: For future. Q: What are connections to scientific computing? A: Cloud is successful focusing on fine grained parallelism. We don't have good abstractions of scientific computing. In some sense, cloud is an evaluation of grid computing. Grid did not have virtualization, this is solved in cloud. Q: Notion of data independence, is it important? A: We are preaching to people who say consistency is not important, we have to start there. Q: Virtualization, why is not more critical in the talk? A. Virtualization is not carefully studied in database community (except JDBC), what does it mean for D/B? (virtualization has been studied in processor world). Great discussions. Divy brought seriousness and reasonableness to discussions about the cloud.

Beyond the plenary talk, I gave my VLDB tutorial on Data Management Problems in Ad Auction Systems. The talk was in two parts: first part on background math of auctions, second part on novel data mining problems. The first part may have turned off many, the second part had smaller audience, but generated serious discussions.

Finally, I liked a paper from TRDDC (see under publ here) where they proposed methods to tag entire sentences as being specific or general. The application was to analyze a survey about the company cafeteria filled out by employees. Some employees were terse, others verbose. This work had to analyze free text and if I recall right, they used the sentiwordnet. Nice study! Also, there was a detailed study from Yahoo! Bangalore on extracting attribute values from a webpage (like address of businesses) which was interesting.


Tuesday, December 14, 2010

Notes from India

On Indian highways, one sees signs such as "Watch out for traffic in the wrong direction", or "No bullock carts on highways", and heavily loaded trucks move at snail pace.

A while ago I wondered, what is the height of individual consumerism in coffee, eg., is it having sophisticated coffee/espresso machines at home, or grinding one's own coffee beans, or gettting beans roasted and delivered fresh? Going beyond, I would love to have a coffee roaster in my living room, a scheduled roasting that would spew out the beans in right amount each morning. :) In a similar vein, what is the height of individual consumerism in travel in India by cars? Is it renting one and driving it by oneself, or hiring a driver with the car? One of the drivers recommended that for my two weeks stay in India, I buy a Nano, use it while I am in India, and get rid of it before I leave!

The main street in Indian villages is not a street, but the Banyan tree. It stands spread out, carries scars of brushes with cars and cows, has marks of deities and children born, houses a shop or two selling tea and tschotkes, and is a gathering place of men and women to avoid the heat, or squat, chew paan and socialize.

On the flight, an annoucement: "Aircraft ke Captain ne Seat Belt light on kiye hein." So much for Hindi.