Tuesday, December 26, 2006

Visiting researcher at Workshop on Algorithms for Data Streams (IIT, Kanpur)

Monday, December 25, 2006

Seats, permutations and (re)arrangements

Seats in airplanes are among the most varied of the commodities. Over years I have developed my own cheat-sheet of questions and demands to weed out the bad seat assignents and garner comfortable ones. Check out seatguru which makes this whole process much more principled. :)

We need something similar for hotel rooms too, where my fading cheat-sheets are not triumphant. Best views?, far from the elevators or the vending machines and the associated noise?, weird layout?, used to be smoking ones?, smelly carpet?, working wifi? ... This will avoid the series of room re-assignments I go through when I check in.

Things you learn during travel

I heard about Google Dance from Tilak Raj, a student in India, which is an event structured around the Google index crawl. On the web, there is a picture history of Google Dance from 2002 -- 2005, and a lot of third party companies that provide tools to monitor Google Dance, and outside interest. I don't think there is actual dancing involved. :)

Sunday, December 24, 2006

Compressed Sensing in Media, Contd.

Barry Cipra and SIAM News join the line, discussing Compressed Sensing. Unlike the Economist and the Guardian, this article in the November issue is somewhat technical and discusses the idea of going from l_2 and l_0 in sparse approximations to l_1 where linear programming applies. Alas, perhaps because the focus is on l_1, the article leaves out much of the work in theoretical CS that uses direct combinatorial ie., l_0 techniques to solve these problems. Work in TCS that predates Compressed Sensing include sampling for fourier reconstruction in learning theory (Mansour), complexity theory (Akavia, Goldwasser, Safra) and streaming (Gilbert, Guha, Indyk, M, Strauss). The articles has a great quote from Bob Dylan.

Saturday, December 23, 2006

Pens in India

There are many uses for pens in India beyond writing (I am reminded of a friend of a friend who found 100's of uses for hairdryers). As participants at the recent workshop on data streams at IIT Kanpur discovered, one recommended use is to thrust the pen into the Ground in an electric outlet, thereby opening up the twin holes to plug in your device. This is a child-proof mechanism.

Notes from India

- I watched a painter on TV who used his blood to create paintings (of gods? of freedom fighters who had shed blood for Indian independence?) enroute to an entry in the Guinness Book of Records (?). Sigh. He was using an old-fashioned shaving blade to serrate (verb) his arm in a controlled flow of blood from his body to the brush and beyond, on to the paper.

- Stray dogs have vicious faces and maligned bodies, and on close inspection, are unique in the scars and persona they carry. I recently checked in to a hotel that gave me a standalone, rotunda of a room, with a flat roof. Before entering the room, I noticed a stray dog on the roof and spent the night on a bed listening to the pacing dog. I was glad at least for a night the dog was kept off the streets.

- Some of the reasons why people ask to change seats with me when I am traveling alone and have just settled into extra-leg room, aisle seat on an airplane: "I am 6 ft tall and can't fit my legs in this space". "I am claustrophobic."

Friday, December 22, 2006


When I was a boy, we used to chase trucks bulging with the harvest of sugarcanes, lumbering their way to sugar refineries. We would hop on to the trucks, break off pieces of the cane, eat a few and trade the rest or simply share. What we needed was limited, so the whole system was not terribly abused. (The coal that fueled Indian railways used to be shared with villagers on its route who appear with their sacks along the slow-moving cars, a practice that may still exist.) I was reminded of that on my recent trip to India when I saw efforts to grow the Ethanol industry that relies on sugarcanes. I hope those trucks keep lumbering to one refinery or the other, whichever, and keep the local boys happy with their tiny feats.

Sampling, Sketching or One Following the Other

I was at the Workshop on Algorithms for Data Streams at IIT Kanpur the past few days. A question that was posed: we can use sketches (projection of the input signal along random directions) to approximate different functions on streams, but in applications such as IP traffic analysis, one can not afford to do even the work for sketching on each new item; so, we can only subsample from the stream first before applying the sketching solutions. How well does this procedure work?

I have a way to formalize this question as a theory problem. Consider each item and ask: should I skip this item, or should I sketch it? If this can be decided quickly (ie., o(sketching time)), and we can still guarantee approximation of desired functions, we have a win. The results I have with my coauthors Supratik Bhattacharyya, Andre Madeira and Tao Ye are that this is possible for many functions. There are two tricks:

  • you have to use the norm of the skipped part to make this decision, and
  • you need to estimate the norm of the skipped portion quickly ie in o(sketching time) per item. this is somewhat circular?

The approach provides a way to actually design nontrivial algorithms and prove theorems about approximations. For example, we get a factor 4+ eps approximation for estimating F_2 and get constant factor speedup over best known F_2 estimation on streams (which gets 1+eps approximation). Improvements to this will be cool. Our paper has other results, theoretical and applied. In the applied world, one of the biggest win we get is in finding heavy-hitters within contents of IP packets, where this method allows us to skip entire packets without processing them. Better understanding of this can help tremendously where people have to resort to specialized hardware to solve this problem.

Disclaimer: I have some randomized skipping ideas that gives better results than ones in this paper for some cases, but more on that later.

Thursday, December 14, 2006

Continuous Communication Complexity

Alice and Bob are seeing a continuously changing set of numbers. At time t, set Alice has is A_t and set Bob has is B_t. A_t and B_t each change by addition or removal of an element during each time step. They communicate with Carole independently (one way or two way). Carole must report |A_t union B_t|, at all times t. The goal is to minimize the total number of bits sent between (Alice,Bob) and (Carole). Is there a good theory one can develop?

Notes: One should allow Carole to approximate |A_t \union B_t|. This seems related to Slepian-Wolf where A and B independently communicate their individual streams to C and the goal is to make use of mutual information in their streams. Things get tricky when we allow A, B or C use models to predict how A_t, B_t vary over time t.

Refs: There has been a few papers recently in the database community that give upper bounds on the number of bits communicated for different problems. It is called distributed, continuous streaming. There is a survey and a tutorial by Graham Cormode and Minos Garofalakis. What we need is a good theory and a semblance of lower bounds.

Probabilistic Data Streams

There is a new model of probabilistic data streams by Jayram, Kale and Vee: Each item on the data steam is an independent probability distribution, specified say with a small pdf/cdf. Such a stream specifies a probability distribution over the space of all deterministic streams. Now, one can ask about the expected value of various aggregates on such streams and try to estimate them in polylog space and polylog time per new item, much as in standard data stream models.

This is a nice abstract model to study a bunch of problems that arise: sometimes input data stream is probabilistic, some times the data stream is probabilistic because the sensors that do the measurement introduce error and noise, some times the input stream is deterministic but we derive stream from the input by say tagging each item with a bunch of values it may possibly represent, and the result is a probabilistic stream.

A fundamental problem here is, are there any issues beyond just instantiating a bunch of deterministic streams by instantiating each item of the data stream with the given pdf? Other problems include: can we take into account some concentration property of the individual pdfs to get some smooth complexity for problems that goes from deterministic streams to probabilistic streams? is there a suitable model of smooth pdfs for say geometric data streams with interesting algorithmic problems?

Sunday, December 10, 2006

Ranking CS Conferences.

We are constantly trying to rank things. Mostly, we understand that there is a partial order, a tournament, or something more, but the world often seeks a simple total order. I saw a ranking of CS conferences: citeseer (2003?) based on normalized citations; cs conf ranking (every quarter, warning lights, uses quality of referee reports (?) and funding support for travel to confs for students (??)); the curated list at alberta (no Theory confs! ACM/IEEE confs make Tier 1, European confs make Tier 2); another curated list, this with Theory confs;...

Amuse: Saw a reference to *Googol* Page Rank system in the best practices memo at the CRA site.

Saturday, December 09, 2006

Sporting, Sporty, Sports.

When I was a little boy, I used to sprint barefoot on the black mud that was all around me, that spewed out coal chunks and fed the entire region. I was a more of a strategist than a (native) runner. It took a long time to scrub the coal off my skin. Much later, I used to run around on the red mud that was baked by the afternoon Sun and cooled off by the hardworking "ball boys" as I chased the yellow tennis balls that threatened to disappear not only in the blur, but also in the setting Sun that turned everything golden. By then, there was a sneaker between my feet and the fertile land, and I could rely on the technology of a tightly-strung racket to be sporty. These days, I have several floors of concrete and metal as I run on the treadmill, above the city streets. I set the technology to provide me the slopes and pamper me with music and visuals, but still have to rely on strategy to space my breathing and beat the next-door tread-miller. Because I run at the NYSC on Sheridan Square and the well-bodied come there to train and subtly compete with the others. Subtle competition is everywhere, in sports, office work and in research.

Saturday AM

November is a bad month, with deadlines for SIGMOD, STOC, PODS, WWW, EC and other conferences, straddling Thanksgiving and spilling over to December. I finished the last of these deadlines yesterday. I am in

... Pack up the moon and dismantle the sun;
Pour away the ocean and sweep up the wood ....

mood, wrapping up the year. Alas, travel, reimbursements, holiday shopping and other chores loom. But for now I can take an AM off, read, lounge or be silly. I managed to read all of an article in the New Yorker on the growing community of Somalis in Lewiston, Maine, a fascinating account of somalis finding a new life ("secondary migration" in technical parlance, the primary migration is when the federal govertment settles them, in Atlanta, in this case, and then the population flows elsewhere), discovering the lines that separate them from the local people and rediscovering the lines that separate them among themselves, along their clan lines and away from the Bantus with their legacy of slavery.

... I thought that "this break" would last forever: I was wrong

I must get back to work and life I suppose. Just one more pointer: Shouts and Murmurs, in NYer, discusses Excessive Article Disorder, the tendency of many (me too) to put "the" wherever. the dance, the work, the Google, ...

Sunday, December 03, 2006

Lost in Movies

When I was growing up, the only entertainment was movies. Much like the postwar world elsewhere, I was in love with going to watch movies at the theater, and some days watched upto three different shows. Late nite movies were my favorite. But I am all grown up now and it has been several years since I watched movies without a break of multiple days between them. And a long time since I watched a late, late, movie. Last nite, I managed to snatch a moment from the past when I went to see the midnite show of The Lost Highway by David Lynch. Like most people in the audience at the IFC theater, I had seen the movie before of course. We were there to relive the magic moments of madness in the movie, and to immerse in the the Lou Reed-David Bowie-Rammstein-Marilyn Manson soundtrack, sweeter than wine, softer than the summer's night. The movie is a fugue, an enigma, and the web has many sites interpreting it for you. But it is best seen, sitting in a movie theater with loud speakers.
What's your axe?

Tenor... Tenor saxophone. Do you...

(shakes his head)
Tone deaf.

Saturday, December 02, 2006

Compressed Sensing on Media Contd.

More media coverage for Compressed Sensing, in The Guardian this time. The previous one was in The Economist. Application to digital cameras: a single-pixel camera? Cool.

Santa and Beers

Thanks to the NY State and the enterprising Shelton beers, we can now have blog posts about Santa, beers and pretty pictures. Check out the pictures.