Sunday, May 27, 2007


I was wearing a brown shirt with a print of an old couple, she seated, skirt flaring, and he standing behind; I always thought it was an African image. Recently, when I got off the airport at Albuquerque, this woman looks at it and says, "From Tibet?". I say, "Brooklyn" meaning I bought it in Brooklyn, and then adds, "Brooklyn market is universal." She guesses and says "Bengali?" because I guess I pronounced it as "univershal", converting s's into sh's like Bengali's do, and like linguistic mutts like me do. I guessed and said, "Si".


There is broccoflower, a cross between Broccoli and Cauliflower (from Northern Italy, according to Whole Foods). Some say nectarines are a cross between plums and peaches, but there is no evidence of this on the web. Recently I got to eat a cross between a flaky croissant and a dense, chewy Pretzel, yum!

Thursday, May 24, 2007

Data analysis

Data analysis presents many problems in practice. One has to collect all the relevant data first, clean it because most data sets have many problems (missing values, mismatched items, nulls, fictional entries, whatever) and explore the distributions before one can say anything meaningful. We need more research on putting some principles behind this process of data cleaning and addressing the data quality problems.

After the (hunting) gathering and cleaning part, one gets to the fun. Vincent Matossian recently sent me an analysis of the coauthor relationship from CiteSeer, in particular, a degree distribution of coauthors. Also, he sent me a picture of my coauthors as an example (I am sure there are other examples which are more fun).

Wednesday, May 23, 2007


There is a heap of results on clustering problems, many differing in just how one measures the quality of clustering. Even as the community struggles through these variations, thinking through them and applying them to practice, one line of research I liked is the (somewhat?) left-field approach eg in An Impossibility Theorem for Clustering by Jon Kleinberg. Roughly, this paper sets up some basic requirements of any clustering procedure without nailing down optimization criteria and shows impossibility results for determining a clustering that meets all such requirements.

In a somewhat unrelated world, a question I would like to see addressed in such a fundamental theoretical way is related to sponsored search mechanisms. These mechanisms have different properties based on whether one works in the world of impressions, clicks or conversions (ranking and/or pricing), and there is a lot of tension based on which world one adopts. Is there a general way to reduce these worlds to each other, so equilibrium properties and mechanisms we know for some will get related to others? The question is vague and in limited cases, there will be negative answers, but a nifty approach here will be cool.

Bid dynamics in sponsored search

You are an online advertiser and have a limited budget to bid on many different keywords in sponsored search auctions. Different keywords have different traffic, click-throughs and competition. What is your best strategy? In an upcoming EC paper, Jon, Martin, Cliff and I showed that just bidding uniformly (say the same bid) across keywords does quite well. There are details of course and there is modeling involved in setting up the problem.

A very interesting aspect of this question is what happens if all advertisers behave in this manner? In a nice paper by C. Borgs, J. Chayes, O. Etesami, N. Immorlica, K. Jain, and M. Mahdian in WWW 2007 titled Dynamics of bid optimization in online advertisement auctions, the authors use insights from dynamical systems and produce certain perturbations of such uniform bidding schemes that converge and have interesting market equilibrium properties. It is a paper worth reading, lot of open problems remain.

Monday, May 21, 2007


You know the nightmares students have about finals. I typically have curious variations. Sometimes I dream that I start the finals, see the problems, and before I start doing anything, somehow the classroom transforms into social life, and then, much later, I find myself back in the classroom, only this time, it is near the end of the exam, I haven't even begun writing down the solutions, and I am exhausted from the long "dream". Recently, I saw another variation, this time as a professor: I am late and I am rushing to give a lecture, preparing what I want to say in my mind, I nearly leave my things behind on the campus bus as I race to the classroom, and then just before I enter it, relieved to have made it, I realize that it is the day of the finals and I do not have to lecture, but have to write an exam for the students! Sigh, I detour to a laundromat nearby and start jotting down problems for the finals on napkins -- recurrence, data structures, graph problem, one very difficult problem, ...

Sunday, May 20, 2007


I discovered a nice article on the history of Mathematics at Rutgers. Enjoy! CS makes a cameo. I like the first para, here it is, quoted:

Mathematics was present from the very beginning at Rutgers. To illustrate this point, consider the following items. The first math major (De Witt) helped win the Revolutionary War with his surveying. The first professor at Rutgers was a mathematician (Adrain), and his salary was endowed with a 2-year lottery. The most famous mathematician ever associated with Rutgers was an undergraduate (George Hill). The Land Grant status of Rutgers was due to a mathematician (Murray) and George Cook (of Cook College). Finally, numerous buildings and roads at Rutgers bear the names of people who have taught mathematics here: Freylinghuysen, Taylor, Murray, Van Dyck, Titsworth and Morris.


David Lodge manages to capture the madness of summer travels in academia, in his trilogy. Changing Places, for example, is a novel about a professor from an English university (hint: U. Warwick) exchanging places (and politics and finances!) with a professor from an American university (hint: U Cal), and is an amusing read; Nice Work and Small World, continue the storyline and are blander.

I have been tightfisted in commiting to travels this summer, in most cases, squeezing in no more than the number of days I need to be away:
  • SADA in Finland (may 25--31). Summer school. 1 lecture on AMS/CM Sketch and applications, 1 lecture on Inverse sampling and applications, and 1 lecture on new topics in data streams such as probabilistic streams and mapreduce computing.
  • SIROCCO in Italy (jun 5--8). Keynote talk on Data Stream Research. Goal is to give an updated talk, sorta Data Streams 2.0
  • FCRC (jun 12--14). Talk on optimal suffix selection at STOC and learn from the kaleidoscope of talks in EC, STOC, Complexity, SIGMETRICS, COLT.
  • MAPSP (july 1--5). Invited talk on Issues in scheduling sponsored searches. I have been putting together a talk on set of mechanism design/optimization results related to sponsored search, this will be first attempt at getting it out.
  • von Neumann Symposium (july 8--9). Plenary talk on Algorithmic issues in sparse representations. Will focus on compressed sensing and low rank matrix approximation. Kickass group of speakers in the conf, I am so looking forward!
  • CPM (jul 10--11). Invited talk. I will focus on new results and problems in string matching which is forever going through a limited renaissance in its cult.

Wednesday, May 16, 2007


The 50's film Salaire de la peur, Le (Wages of fear) by Clouzot shows how to blow things up in an otherwise leisurely moment; Pacino's "Never me" moment with Benny Blanco from the Bronx promises to burn, and does so, later in the movie Carlito's Way. The beautiful music helps swords explode in minds in the later half as Jet Li plays the Hero; Beat Takeshi, laughs, goes to war in the movie Brother.


I am a fan of contemporary dance, in its inventive form (alvin ailey, see the end!) or sublime (merce C, savor!) or simply brooklyn (AA, again, I like her voice).

I won't be around, but if you are in NY, schlep to Brooklyn for BAM's DanceAfrica festival.

Sunday, May 13, 2007


When people try exotic meat -- say Alligator -- and talk about it later, they usually say, it tastes like chicken (or not). It is the benchmark. Chicken has its own exotic variations, see the blog description of chicken bones, heart and liver at NY's Yakitori Totto by Ruhlman. A hen proves to be quirky in the must-subscribe foodie farmgirl fare. No talk of chicken is complete for me without discussing steak. I remember an ad for a steakhouse that (roughly recalled) said: "Terrifying vegetations since XXXX", with an imposing, full page picture of a steak knife. I sleuthed a little to track this ad, which seems to have a controversial history and its variants.

A Journal

In everyday world, when I mention a "journal", people imagine I am talking about a diary. Of course, we researchers are not in any everyday world. :)

I wondered how the open access journal on Theory of Computing is doing: it was activistic to start this journal, and I see some good papers on the website, but traffic seems low (13 papers in 06).

Wednesday, May 09, 2007


Reaching out and communicating with others is important to me, and I suppose, it is, to many others. But most communication is typically intrusive. Telephones for instance. A tel call may intrude in the middle of a conversation, meal, or a moment. And I feel guilty when I miss a call and see the voice mail indicator. In some ways, emails are intrusive as well. You can't resist peeking at them at night before going to sleep or first thing when you wake up, and an email in the inbox that excites or irks, can spoil the night's sleep or the waking workday. Still, phones and emails are what I live by.

What truly irks are the email requests for reviewing some paper that point one to some website where one needs to type in some information, just in order to decline!

This is an example of what Martin Farach-Colton coined and called as an exigency: it is an emergency exogenously thrust upon us by an act of others, interrupting our life that is going on in its normal course; as a result, we are forced to take an action, though we did not really initiate the event. Alas.

Saturday, May 05, 2007

Data Stream Theory & Summer School

One of the areas where theoretical computer science has been successful is in introducing the data stream model for processing massive data sets, developing key algorithms, and connecting it to fundamental problems in communication complexity, embeddings and others. The basic theory arises from:
Since then, the model and algorithms have found tremendous success with applications in databases and networking; AMS got the Godel prize; basic problems now have tight bounds (lower bounds eg by Woodruff, upper bounds eg by Ganguly). The area is thriving with new directions (Graph, geometric streams and recent works by Guha and others that connects to problems in Statistics).

It is now good to see a summer school on data stream theory. This is organized by the MADALGO center in Aarhus, Denmark, and the syllabus looks very well balanced between algorithmic and lower bound techniques (as well as Univ and Industry!). Wish I could be there to learn!

Algorithms, Inference and Statistical Physics (Santa Fe)

How can I refuse to go to a meeting as broad as Algo, Inference and Statistical Physics? It was held recently at Santa Fe. Wonderful crowd of several coding/information/communication theorists, some physicists, and a handful of algorithmii (Dimitris Achlioptas, Asish Goel). David Tse gave a nice talk on how much information can be routed in a sensor network (his result improves on the seminal work by Gupta and Kumar), more below. There were talks on message passing algorithms, the relationship between Belief Propagation and LP, and others including network coding, and (:-) may be) even physics.

Here is a technical issue. The Bernoulli random graph is G(n,p) where each edge in an n-node graph appears independently with probability p. The geometric random graph is G(n,r) where n nodes are randomly placed on Euclidean plane and any two nodes within distance r have an edge between them. One can analyze G(n,r) much like G(n,p), but the key difference is that the edges are not independent in G(n,r).

While a lot of papers have analyzed properties of G(n,r) recently, for a long time I thought they should really just behave like a grid, and they should be easy to analyze using simple gridding techniques. Gopal and I managed to that for some basic problems in a SODA paper. The capacity result of \sqrt{n} from Gupta, Kumar paper also seems to indicate the grid-like behavior of G(n,r). In a paper that proved thresholds for monotone properties of G(n,r), the authors Goel, Rai and Krishnamchari used red-blue matching, and in a recent talk, Ashish at this workshop formalized some claims about upper and lower bounding G(n,r) by grids for estimating certain properties. So a question:

Can the information capacity results of Gupta and Kumar, or the recent ones by A. Ozgur, O. Leveque and David Tse (OLT), be rephrased as follows: G(n,r) for a particular property (routing,information exchange, whatever), behaves like a well-known graph (mesh, clique or a linked list of some depth)? Studying such property on these well-known graphs is more straightforward. In particular, OLT result that uses recursive multi-transmit, multi-receive routing protocol to transmit O(n^(1-\eps)) bits per pair from n source-destination pairs(Gupta and Kumar showed O(n^{1/2})) is interesting and does it show G(n,r) is like a YYY-type (clique?) graph of some size? This is a speculative question, probably much too vague to think about it.

Networking PC and Co.

I was recently in Berkeley for the SIGCOMM PC meeting. It was very well run by the PC chairs Nina Taft and Anja Feldmann who remained focused, modulated the discussions when they ran far too quickly or stalled, and both challenged and encouraged opinions. Lot of nice submissions, some urged by the clean-slate redesign agenda for networking, some driven by current topics (game theory, data privacy), and others from the standard fare (measurements, protocols). I had to quick-review a paper during the meeting, and did a quicker-review online of almost all the papers that were discussed, but since I had expected to quick-review more papers, I felt like I got off easy.

PC meetings give us a chance to connect with colleagues you never met, only occasionally interacted with, or plain old buddies from the past. So, over the dinner, I broached the topic of how theory and machine learning communities have yet another channel to communicate, ie., blogs and pointed people to the blogs of Lance (Bill), Suresh, Jeff, Luca, David, Scott and many others. Why doesn't the networking community do the same? I have done several of these conversations by now and the typical objections people have to blogs are:
  • I don't even read blogs,
  • I don't have time to do blogs, I am already overwhelmed by emails,
  • I'd rather do other things when I have a free moment, be part of a real life community in my neighborhood, or
  • I am uncomfortable with the bias it gives my voice over others in the community (I belonged to this category about 2 years ago).
I heard a new one this time: networking people may be more tuned to the privacy issues involved!