Thursday, February 25, 2010

Data Streams: Where to go

I took time off and wrote a whitepaper on data streams research, where to go. I will keep editing this and put more meat, bones and potatos over time. Thanks to Graham Cormode, Sudipto Guha, Andrew McGregor and Jelani Nelson for comments.


Sunday, February 21, 2010

Ad Exchange Tutorial at EC 2010

I will be giving a tutorial on ad exchanges at upcoming ACM EC conference (june 7--11, colocated with STOC 6--8 and Complexity 9--12). I tend to have the perspective of DoubleClick Ad Exchange for selling impression ads (eg., my paper), but realize other exchanges such as RightMedia and AdECN are somewhat different. I want to include as much information as possible about the various exchanges, perspectives of buyers (ad networks) and sellers (publishers), both from research and business perspective. So, whether you are a researcher writing a paper on optimizing ad exchanges or their auctions, an ad network expert interacting with the exchange, a business leader with a perspective, or a Sales person wondering how this alternative channel will impact work, please contact me and send me material, I will sift through them and include them in the tutorial as they fit.


Saturday, February 20, 2010

Quite Interesting (QI): Game theory explained

Stephen Fry explains game theory, truels, and advertising. We need him to explain the UGC.

Friday, February 19, 2010

Networkia Academia is creeping into search results. It is a social network of (schools/depts/faculty/graduate students) X (papers/interests) X (comments/follow/like/friends) structure. They have a blog, jobs, and of course they tweet. Doesnt seem to be popular among CS.


Congratulations to Michael R. and others!

The 2010 Dan David Laureates have been announced:
  • Michael Rabin, Leonard Kleinrock, and Gordon Moore
for Computers and Telecommunications share $1M. In literature it goes to two of my faves:
  • Margaret Atwood, and Amitav Ghosh (go Amitav!)


Vignettes on Simple

Buzz aside, emailing is revealing. While emailing, I happened to browse things related to my email trail, and found things that have simplicity and beauty:
  • Craig Nevill-Manning has a blog. It is not updated often, but it has a feel: it is things Craig does, almost like little adventures: simple coding, some data analysis, a few beautiful photos, coffee and music.
  • I noticed that Mihalis Yannakakis gave a talk on "Automata, Probability, and Recursion. " A simple title with the cadence of three theory pillars falling in line.
  • Rectangles are simple, and find themselves in Art from Mondrian, Hopper, to Rothko and beyond.

Sunday, February 14, 2010


In a moment in The Office, Jim appears with Face/Book written on his face and his colleague thinks it is BookFace. Anyway, here are two thoughts:
  • Heard around the town: "Before facebook, the Internet was a niche phenomenon".
  • Over dinner, I said: "Man is a social networking animal." (Ref., Aristotle?) Contrary to suppositions, I am one of only two to say this on the web so far.


Late thank you for Holden: nearly everyday I carry his voice of alienation in my head and see NY through his eyes, the pond and all else. Thanks for the ephermeral Phoebe, Allie in the background and the Bananafish. I dont want to be a phony, so I will stop with some stories here.

Saturday, February 13, 2010

CS as Service in Univs

Some of you may have thought about this more than I have, so this is a blog for help. The premise is that Computer Science (CS) is useful for nearly all other disciplines (Engg, Sc, Arts, Law, Business, whatever). Formally, students in other disciplines not only use computers and software, but often have to create software for their tasks, use sophisticated algorithmic primitives etc. So, they need to understand CS.
  • What and how much CS should others know?
  • What role should CS depts play in the University, for example, do other depts teach CS courses or does CS pull a Math and teach basic CS course across disciplines?
  • What are the ways to quantify the impact of CS depts across the Univ in this context and how to improve the impact (eg, joint research with other disciplines helps or hurts)?


Big Data

A bunch of things happened that made me think about big data.
  • A dinner conversation yesterday brought up NSF's general emphasis on data-intensive computing and specific example like the center on Foundations of Data and Visual Analytics at GaTech or supporting infrastructure for Hadoop/MapReduce/Azure in Academia. The following are given: there is a lot of data, lot of researchers want to analyze them, there are real performance bottlenecks in analyses they need, and we are in a position to throw processors at them. But IMHO, the CS research community has not yet abstracted clean models --- for building, for analyzing -- that will truly address this state of world. We need a Valiant-esque insight and effort here, something that threads between H/M/A approach for special tasks, PRAM/BSP for general parallel computing and currently popular multicores. Traditional appreciation of costs of moving data and computing state between processors or synchronicity don't hold, and one has to endogenize the fact that reading from a remote processor's main memory is cheaper than reading local disk given the communication infrastructure in data centers.
  • Then there is the practice of handling large data. There is a new DHS center for Advanced Data Analysis (CCIADA) at Rutgers U. I am the Director of Data Research. Many organizations across the country have various datasets with complex rules for access, use, and at a higher level, what can be inferred from them. IMHO, the research community is far from formulating a model for working with such data constellations, with data mapping, provenance, and trust issues, a model which will support some algebra on top of the data and instinctively automate data handling issues. This is a big bottleneck for research to flourish.
  • Finally, gmail ads are an interesting beast. Sometimes I am unware of them, some times I find them entertaining trying to figure what drives ad systems to map my emails to the specific ads they show, and once in a while, an ad sneaks up on me and I follow the lead. This morning I saw an ad for the Big Data Summit (last year's here). Like other industry meetings on this topic, this meeting too seems to bring together the right players who want to solve the problems, but I am not sure the industry has novel insights into the big problems here. Apologies for speaking without going to the summit.


Friday, February 12, 2010

Labs Viz

I was browsing through home pages of various corporate research labs and their researchers, and liked AT&T Research's new pages. Each researcher's page has a GraphViz rendition of their coauthors that you can hover over and select.

ps: Totally unconnected, I was thinking about research hyperboles. We have already worked on problems that impact "billions" of dollars. Next, papers will claim to solve problems that impact "trillions" of dollars (applications to US govt)?


Thursday, February 11, 2010

FOCS 2010

Call for papers is out for Foundations of Computer Science (FOCS 2010) conf. Deadline: April 7, 2010, 7pm PST (less than 2 months away). Notification: June 29, 2010. Conf will be at Las Vegas, October 23-26. There are no page limits for submissions, but there is a prefix clause.


Sunday, February 07, 2010

WSDM, Vowels Ltd.

WSDM 2010 conf took place at NYU/Poly last week. It is a conf primarily focused on web search and web data mining, with a strong attendance of 200 or so. It was good to see the Yahoo! folks (Andrei, Evgeniy, Vanya, Sergei, Baeza-Yates, ...), Microsoft folks (Rakesh Agarwal, Rina, Sreenivas Gollapudi, ...), some of the alg/database folks (A* Das Sarma brothers, Laks L, John Byers, Tomasz I, ...) and students (Aleksandra, ...).

Soumen Chakrabarti gave a plenary talk. Between the two extremes of current web search based on query terms and say natural language or SQL-complete query language, Soumen identified a language (S-language?) with variables, predicates, and certain aggregates, suitable for web searches that are now a challenge. Then he discussed the main tasks in supporting this language. These include spotting (entities+context => larger labels, called spots), disembiguation (by connecting spots to Wiki) and ranking (proximity models + context scoring -> ranking, eg., by reducing it to selecting rectangles from 2-dim points). This S-approach ultimately generates billions of microlinks between web pages and creates new info pathways in the web. Tom Mitchell asked, what is the weakest link in this approach? Quantifying the accuracy of various annotations (in particular, in terms of search accuracy) is hard. Of course, ranking beyond single snippets is a challenge, that to be formal, needs reasoning about probabilistic data. Other questions were how does S-approach differ from semantic web (which orders the web, this approach works with the disorder), NLP (other languages here besides natural languages such as lang of tables, site organization etc), IR Trec, Question & Answer systems, and so on.

Susan Athey gave a plenary talk on Ad Marketplaces. She started with the mantra of Economics Theory + Empirics + Experiments => exciting world of ad auctions, exciting even for Economists who have seen big successes with auctions in reality. She gave a high level view of auction-based platforms: aspects of information feedback and dispersal that makes them unique, the objectives that include long term participation and 1st order issues of competition, etc. She then argued for building structural and behavioral models for ad auctions so we can learn counterfactuals and predict out-of-sample situations, all disclaimers notwithstanding on the challenges of making such models stick. The bulk of her talk was on her forthcoming work with Denis Nekipelov on "Equilibrium and Uncertainty in Sponsored Search Advertising", which follows this methodology. One of the outcomes of this analysis is the guidance we now know well in sponsored search that bidders need to operate where marginal cost per click equals their value, but the more interesting outcome seemed to be that they could plot data for what would happen if we had 20% more bidders and other what-ifs. Also, stochastic budget optimization problems pop right out of their models. Tom Mitchell asked why there were negative dips in the plot (I gave Tom the answer after the talk: when you increase bids, you qualify for new keywords with higher reserves and spend more per click, reducing total clicks for your budget; there are other discontinuities in the system). I asked if they had looked at applying this methodology to display ads (vs sponsored search), which is a bigger beast, not yet understood, where the potential for new impactful work is much larger.

  • My small contribution (with Sergei and Sihem) to WSDM was to propose The Park for banquet. The space worked out well, drinks flowed, food worked out less good.
  • There was a tweet stream running live on the screen during the talks. The speakers dont get to see the stream! :) Bunch of blogs about WSDM eg., Daniel.
Some random thoughts:
  • Did Wiki save IR research?
  • Is there a non-vector of features (bag of words) view of the world in IR, please?
  • Michael Jordan saves NBA, speaks for Nike and is a perennial web search example.
  • First price auction is the whipping post of ad auction research.


Wednesday, February 03, 2010

Research Grants

FYI: Some large research grants to universities from Google, from here (congratulations to all):

Machine Learning

  • William Cohen, Christos Faloutsos, Garth Gibson, and Tom Mitchell, Carnegie Mellon University

Use of mobile phones as data collection devices

  • Gaetano Borriello, University of Washington and Deborah Estrin, UCLA

Energy efficiency in computing

  • Ricardo Bianchini, Rutgers, Fred Chong, UC Santa Barbara, Thomas F. Wenisch, University of Michigan, Sudhanva Gurumurthi, University of Virginia
  • Christos Kozyrakis, Mark Horowitz, Benjamin Lee, Nick McKeown and Mendel Rosenblum, Stanford
  • David G. Andersen and Mor. Harchol-Balter, Carnegie Mellon University
  • Tajana Simunic Rosing, Steven Swanson and Amin Vahdat, UCSD
  • Thomas F. Wenisch, Trevor Mudge, David Blaauw and Dennis Sylvester, University of Michigan
  • Margaret Martonosi, Jennifer Rexford, Michael Freedman and Mung Chiang, Princeton


  • Ed Felten, Princeton
  • Lorrie Cranor, Carnegie Mellon University
  • Ryan Calo, Stanford CIS
  • Andy Hopper, Cambridge University Computing Laboratory