Sunday, July 31, 2011


It occurred to me --- talking to professors about their students and labs researchers about their interns --- that there is a tendency to help those who can swim to swim faster and further, but how much do we help drowning students get ashore (at grad school level)?

Personal: Parenthood

Friends ask me how is parenthood. Like grapes. When I was little, even though I dont recall it, I am sure my parents peeled the skin and deseeded each grape for me. Later, when I was a youth, even if I dont recall it, I must have picked grapes off for my romantic interest and shared a moment of togetherness like in music videos. These days, as a parent, I watch the kid so she doesnt choke on a grape, and eat whatever is left over so the plates can be washed. And if I had more time, this blog post would have been a haiku.

Algorithms in The Field (Update)

Some updates:
People ask me about algorithms in the field, and I say, "it is getting theory and applied researchers working together so early that you can not tell what came first, friends or facebook". :)


Two hats from Twitter

People observe twitter for various reasons. I recently observed that in the research world, twitter, because of its policies that enable easy data access, has a great mindshare (just search for twi in ICWSM 2011). This blog is about two talks on twitter, by Ashish Goel, with two different hats worn comfortably.

Talk 1 was at Workshop on Ad Auctions on Monetization. He started with the observation that twitter supports both highly individualistic view so you can follow an individual or the aggregated view via tags and trends of many. He spoke about 3 products: promoted accounts (on the right, to follow), tweets in search (on top) and trends (on the right, via hashtags). The guiding philosophy with promotions is that they should conform to user's expectations of tweets and have follow-on effect. Ex, promoted trend leads to conversations, promoted accounts continue to deliver future impressions, promoted tweets lead to retweets, etc. Ashish released some impressive numbers on how these efforts were going, and few examples.

He then spoke about challenges. On the strategic side, there is need for novel promotion ideas. Twitter is not about time spent on the site, so these should be beyond display ads, and encourage twitter experience and follow-on. On the algorithm side, challenge is to work with a short window of time/data, and estimate one's current interest as well as estimate future follow-on effects of promotions, machine learning at the speed of thoughts/tweets. On the auction side, the challenge is to design pricing and allocation for extremes like promoted trends which has to reach many, and promoted accounts that are highly liquid. This talk was structured perfectly, with an overview of the products, main philosophy shown through the interaction between these products, specific numbers and examples, followed by what the audience wanted, namely, technical challenges. Nearly everyone in the audience appreciated one slide, where Ashish gave examples of suggestions from the research community that will NOT be valuable, these are throwoff ideas like, use keywords in tweets to target ads or show promotions in timeline or offer coupons, etc. Chances are twitter which has smart full time folks thinking about these issues would presumably have analyzed the many immediate suggestions already.
Talk 2 was at SIGMOD, where Ashish gave the keynote industry talk, on data models and management issues at Twitter. He started with the mission statement: instantly connect people everywhere to what's meaningful to them.
Problem 1: How to deliver tweets (several k's a sec, each going to more than avg degree of users, and has to be aggregated into timeline). A natual data model will compile tweets into tuples and apply map and reduce (MR) operations. However one needs some sort of continuous mapreduce (MR) system to be real time, but such systems are mainly in academic stage and robust versions have to be designed and made available at large. You could also discover duplicate tweets with a combination of continuous MR and locality-sensitive hashing (see some inspiration here). Problem 2: Search, real time relevance ranking of users and tweets, personalized. In the future, social graph and resonance scores. Some discussion here. Problem 3: Incremental Pagerank, maintain pagerank as edges, nodes arrive. He referenced the VLDB paper with roughly O(nlog n) solution where n is number of nodes, and m, the number of edges has minimal role, improving on the trivial O(nm).
The second data model Ashish explored for these problems was active DHTs (distributed hash tables with user-specified code) which are like continuous MR but with reducers able to communicate with each other. He then discussed maintaining suitable random walks on social graph via active DHTs (see here for db details) and its uses from incremental to personal pagerank. The third model Ashish spoke about was semi-streaming where there is memory to store row-IDs and some properties, but not all matrix entries. This seemed more suitable than "clever" graph partitioning because social graphs seem to be expanders that change rapidly. Many recommendations problems will be of interest in semi-streaming. He summarized with the excitement of mix of database and algorithmic challenges at Twitter.
Summary: These two talks, juxtaposed in time, were among the most informative and insightful I have heard of a tech business, able to move between sound business and solid technical concepts as needed, a rare and impressive achievement for a theory researcher. Ashish is one of those speakers who you appreciate when you hear their arguments, but with every question you ask, he peels more technical layers, and you appreciate how much more he knows and how much more depth there is, in other words, his talk is more than the total of his slides.


Friday, July 29, 2011


Workshop on Internet Tracking, Advertising and Privacy (WiTAP, if you get the pun) took place at Stanford on July 22.

Dan Boneh (who organizers credited with quarterbacking this meeting) started the workshop and introduced the amazing crypto+security group at Stanford, seminar series, mailing list, and link to course notes and certificate course programs.

Bala described his work on busting privacy issues via a series of papers that collect and analyze data, from '06. He said that people dont read papers, but people read WSJ. He talked about things getting worse with each paper, from web to social to mobile. Users assume they interact with first parties (web sites, social networks, telephone companies), but if you dig deeper, for a variety of reasons, second, third and other parties intervene in the conversation between the user and the first party. The reasons for this leakage could be because first parties outsource data collection and analysis for operations or mining or revenue, whatever. He talked about mechanics of leakage: flash cookies that are hard to delete and that respawn; http headers that leak referer/cookie/requestURL; external applications; mobile social networks that leak presence, location and device ID; etc. From leakage, next development may be that aggregators can join across data sources, and beyond this, fingerprint browsers. What can one do? Bala pointed users to the FTC 2010 report on privacy for more awareness. He also pointed toAcquisti's work on separating offline and online identities as well as Economics of privacy. Among questions: research community knows these leakages, why dont public follow? A: May be the tap in UK is a turning point; Privacy expectations are culture specific. A: Study has to correct for countries, china, korea, US.

Ed Felten spoke next, communicated his personal goal to "write a few lines of code everyday" (in my case, when I was a graduate student, the mantra was, "learn one new lemma a day") but talked about his policy work at FTC for monitoring businesses. He communicated how FTC can enforce: deceptive use (misleading claims or use by companies) is easy to regulate, unfair use is harder to define, and in general, before an issue can come up for FTC enforcement, researchers have to go beyond research and develop support from general population for their case.

One of the anticipated talks was by Omar Tawakol of Bluekai. He spoke about how context and sites are no longer the key in Internet advertising, and user data was the new fuel. He stated the premise of his company that data is too much in shadow, consumers are smart and can handle the complexity of data, and how they support opt-out cookies, open source and so on. How much will users pay for ads free: little. He phrased a technical argument that the plumbing you need for (a) user conversion tracking where one measures whether user actually bought/converted some time after seeing an ad, (b) retargeting, the most effective form of advertising after search, where a user seen at once place is targeted elsewhere, and (c ) frequency capping, where an advertiser limits how many times a user sees an ad, are all the same. (a) and (c) are very legitimate uses for advertisers. Does that justify (b) too? Without 3rd party cookies behind (a-c), in particular (b), he claimed that 70--80% of 7B spend on display ads will be at risk. Tradeoff of free content vs ads is not fair because users don't know what they are trading for the free content; if you aren't paying for it, you aren't the customer, you are the product; why don't we turn off targeting?; why not turn off ads altogether? He proposed that these were all good questions. He tried out a technical argument that if 50% turn off ads, other end up paying for their free loaders content and that is not fair. Not clear this argument flies. He said Bluekai has a dashboard so users know what is known about them. He further pointed out that k-anonymity of a dataset does not help quantify the damage when two or more datasets are combined; he argued for some "smart noise" that will obfuscate the data and still keep it useful for advertising. Questions: common pipe may be avoided by doing (a) and (c) at the client browser, without doing (b); q: what is the marginal value of tracking users; etc.

Russell Glass of Bizo spoke about targeting business professionals. He pointed out also that you have to see lot of ads to make small amount of $'s (margins are small in data/display ads). He argued that what the industry needed was certainly regulation, but that will not provide "trust" which is sorely needed.

Dean Hachamovitch spoke about Internet Explorer and argued it was good at blocking malware and pointed to for consumer protection. He spoke about the list data structure behind how users can specify track/dont track sites. Knowing he was talking to research audience, he posed some directions: How to describing and visualize information flow between sites; how to generating, validate and value tracking lists for curators; etc. Questions: couldnt multiple tracking list conflict each other in white/blacklisting same site? Q: Do you see visualizing information across sites as a browser problem? He wanted a more general solution across vagrancies of the browser.

Narayanan spoke about Adnostic where behavioral profiling and targeting takes place in the user's browser. The ad network remains agnostic to the user's interests. Jonathan Mayer spoke zealously about DONT Track feature (8 byte request in http header?). Paul Francis spoke about privad, a privacy-preserving way to see and be shown ads. Nina Taft did a great job of describing research goals of the new Technicolor Labs in Palo Alto (great location, people and interns!). She had math on her slides, describing differentially private SVM algorithms. She also posed problems: how to model users in a home entertainment scenario; how to do social recommendation at home where users share resources; etc.

Altogether, a good set of talks, a very large audience (200+?), and an important set of topics, I enjoyed the workshop.


Sunday, July 17, 2011

Pot X

Yesterday at dinner someone said, in Australia, we have "pot plants", and someone else said, in UK, we have "pot pie", and I wanted to add, in India, we have "pot Poori" (no typos), but good taste in conversation made me bite my tongue and swallow the words.

In a NY/NJ train, the youth in baggy clothes sprawled on the seat in front of me, turns to me and asks urgently (which I understood after asking him to repeat it a couple of times), "how do you spell, shield, man, shield, you know, shield?". I spell it, and he types it into his SMS. Shield? Um....

The blonde woman dressed for the club sits next to me at Starbucks and yells into the phone (profanity withheld), "I am a lousy driver, I dont drive, why dont you just tell me where to go, am parked like right here". Long conversation like this, help from me, checking GPS etc, and then she tells me, "Oh, I am so close to my friend's place, it is only 3 miles". In NY city, in 3 miles, you can go from Upper West to Upper East, Village to Midtown, Apollo theater to City center, all along wondering why you would want to do it!

On a NY/NJ train, at a certain station that locals can guess, a man gets on, sits behind me, looks around to take in the scene, watches me read the newspaper, leans over in the little while and asks, "that the wall street journal?", I nod yes, and he says, "they shrunk the size?", I say, "yes, a while ago", and he says, "yeah, been inside for long" and chuckles.

In San Francisco, I get a drink, and this old gentleman engages me in a conversation. We talk about many things and at one point, speaking of why his PhDed son didnt choose to work in research, he says, "you have to really want to write papers you know, and publish them." Simply said, and true. Later, I ask this 70+ year old man what he wants to do, and he says, "Go to NY city, don't know, stay there for long, for two weeks or something, and soak it in."

Thursday, July 14, 2011

Demos, Bs and word humor

What does "demo" mean? First time I asked this question, I remodeled my apartment and the contractor said they will demo. I thought, with graphics? Since then, I have learned that "demo" can mean: demonstration (garden CS kind), demolition (mo contractor kind), demote (an ad), demographic (again, ads), ...

Today over lunch I faced beer pressure. (no typos there)

I get confused between the B "triangle of beaches: Belize, Bermuda, Bahamas and Barbados.

Tuesday, July 05, 2011


PODS 2011 (a database theory conference called Principles Of Database Systems) has 1 plenary talk and 2 invited tutorials. I gave a tutorial on data streams research. My talk slides are here. It is difficult to present a tutorial on an area in 1 hour, and in particular, on a topic like data streams that has seen a lot of research, and in many different communities. My tutorial presented one technique (count-min sketch) and its applications, and emphasized the role of database community on this area (slide 32). The question of one pass computing, the motivation for dealing with large data streams, and most important for me, systems that actually use streaming in practice and validate the research effort of nearly two decades, all came from this community. I argued that their questions and drive eventually led to not only good theory of CS but also impact outside CS such as in signal processing community, where to at least some extent, compressed sensing results could be obtained by applying the count-min sketch. I would like to believe that this is a new vantage for the database community to value their technical impact (not the use of their technology) beyond CS. Some notes:
  • Tova Milo gave a nice plenary talk on business process data management (imagine queries to travel agency like, can I get a price quote without giving first my credit card details?), nearly all aspects from data models to query language to logic, optimization, provenance and structural/module privacy. More info here.
  • I spent some time with Raghu and Divy. Raghu has done everything, and it is great to pick his brain about the potential evolution curves of MapReduce, BigTable/Cassandra/ and others. Divy is super open to discussion on any topic, and we continued our past discussions on auction and pricing, this time more focused on selling data. I also spent time with Divesh, charming, incredibly knowledgeable with footprint all over SIGMOD as usual, and the ever-clever Alon Halevy (I cant break his story yet!).
  • In PODS business meeting, it was suggested that number of emails be used as a measure of complexity of organizing a meeting.
  • Finally, PODS overlapped with the day of protests and general strike in Athens. Normally the Athenians (and Greeks in general) have an alluring air of anarchy about them, and this day exaggerated the effect. I mistakenly walked into the intersection with tear gas, and not only teared and had a serious headache afterwards, but also spent the flight back to NY with no feelings on my left leg. Notwithstanding that, I was happy to be in Athens where the modern world began.
The database community likes to dress up, dance and party, and the banquet was indulgent, transforming bag-carrying, laptop-checking researchers into socialites. A 100 year old nonprofit group performed traditional greek dances --- simple to complex leg movement, with simple stringed instruments. Over the banquet, I managed to catch up with Peter Buneman and Johannes Gehrke, and talked about the EU research program on human brain. The question was, what is a big database research issue in human brain research.
ps: Missed going to this bookstore to look for Arkas in English.


Monday, July 04, 2011

FCRC (Google Meeting)

At FCRC 2011, Kate Berrios, Lisa McCracken and others of Google organized an event for academic researchers --- students and faculty --- to meet with Googlers. There was an unusual format with 10 or so short 3 slide presentation of projects and their major challenges by googlers and then each of the presenters was available at a table for an hour+ of discussion and questions. Some notes:
  • Infrastructure: A web service request gets translated into 100's of parallel machine requests. Rare events (1 in million) happen! Fiuring out timing of processes across machines is a challenge. Performance problems in reality is hard to reproduce in labs because of the scale and complexity of real life.
  • Cluster management: external view of the cluster is simple, internal view not as much. Each service depends on scores of others. Anything new introduced makes some people's life better, everyone's life simple bit worse. How do you move an application from one data center to another, when each depends on scores of other services? Failures are a reality, make them first class objects in thinking about systems. Load variations are enormous, 100B to 6 TB per request in some cases.
  • Profiling systems at scale. Goal: 0% overhead for profiling across systems, platforms, languages, versions, and it has to be stable and everywhere? Very few benchmarks because world constantly changes and cant be reproduced. Use lot of open source tools and publish them.
  • Social: social interactions are more than 2 people, or edges. Social behavior does not happen on graphs. Directionality, symmetry, organizational structure, etc. Are users comfortable with sharing: any measure of this and associated utility?
  • Android security: Malware problem due to being popular and also open (due to large numbers). A problem: find andorid malware (client side? power-aware?). Malware perform redundant operations to be deceptive. Prob: Strong authentication methods for cellphone as a personal authenticated device?
I spoke about challenges of ad exchanges, and afterwards, at the table of market algorithms researchers --- Costas Daskalakis, Mukund Sundarajan, Gagan Goel, Vahab Mirrokni, Darja Krushevskaja, and others --- tried to shepherd the conversation to focus on large technical challenges, with the somewhat controversial hypothesis: "Research labs have given researchers good research problems, but not great ones. To engage the best of the research brains, you have to able to pose truly great problems." (In response, Costas said, "Archimedes was interested in squaring the circle." :)) Some of the discussions that ensued:
  • Bayesian vs worst case values. Of course! We discussed Leonard Savage's work on Bayesian assumption. Do advertisers know values? Who knows priors, bidders or mechanism. Can we design mechanisms which will work in adversarial as well as Bayesian priors nicely?
  • Multidimensional revenue optimal mechanisms, problem to solve. Correlated bidders.
  • Auctions that optimize not expected revenue but risk aversion (see eg. Mukund + Qiqi's work), say concave utilities and say other stochastic measures besides expectation? Challenge is also conceptual, what is a suitable benchmark?
  • Automated mechanism design?
  • Repeated games. 2 bidders with budgets, repeated second price auctions, what happens? Bidders learn, CSists might have a better handle on this theory than say Economists.
  • How to integrate auctions and machine learning? Of course! Variations in input vs predictions needs to be modelled. Online solutions vs where ML applies.
  • Big problems, eg., can we provide guidance on how science budget should be allocated among various disciplines, or NSF CS budget among different areas? Given a subset of researchers, say we can estimate their impact on the society when funded. Given this oracle, can we allocate funds to people to maximize social welfare? Can we model people switching teams in second round or open bid systems for reallocating funds? Q: Why doesn't NSF give $'s to 2 teams for the same project and get them to compete? For some recent work, see the work of Shay Kutten, Ron Lavi and Amitabh Trehan.
  • Level k thinking and dynamics. We discussed 2/3rd of the average game (did you know, there was a museum of money?). If you have a wrong model of competition (say you think you play against MIT students, but we are not), then you can do wrong things even if you are rational. Vahab mentioned several results on level k dynamics and ad auctions.
  • Assumption about others. Can one design a mechanism that is independent of assumption about others, robust to beliefs, solution concepts, and despite sophistication of bidders. For ex, some black box where myopic learning leads to good equilibrium?
  • Abstract a model, deisgn incentive-compatible mechanism, go to real world, property doesnt hold. We need a notion of how incentive-compatible are bidders. Need a useful analog of approx for incentive compatibility.