Sunday, April 30, 2006

Artist in us.

Senegal has given us Youssou N'Dour and others. A senegalese painter, Ibou Ndoye, is getting some modern attention. Check out his work at enfuse. He paints african faces and masks using bright oil colors on glass. He says he "soclialized with art and cohabited with colors" while growing up. Many of us are artists inside, honing our skills and our feel everyday, while on a mad commute, in noisy streets or during quiet moments. The sublime artistic sense grows and evolves even as we just live our lives.

Interesting String Problem

A student recently gave me the gift of a book by Dennis Shasha on puzzles. Here is an algorithmic version of a puzzle in that book.

A string S of characters is interesting if for all pairs of characters x and y in the alphabet, the distances between their occurrences in S do not repeat, that is, S does not have S[i]=S[j]=x and S[k]=S[l]=y such that k-i=l-j, for i < j < k < l. AAB is interesting but ACBADB is not since (A,B) appear separated by 2 places twice. How long does it take to check if a given string S is interesting?

Hair Braiding

I got my hair braided today to put a few weeks of illness (hence my silence) behind and start May with style. Getting hair braided is a tiring experience. First, one has to walk in to the braiders and negotiate a price while they speak in unknown african languages mixed with some french (later, you find out that they can speak english quite well). Second, one has to buy the hair and accessories which exposes one to the perfect marriage of chinese manufacturing and retailing conveniently located next to the african human workforce. Then, one has to sit for several hours, a research paper clutched in hand, but drifting between the slumber induced by having one's hair pulled tight and the tension of guttural voices of the braiders blasting in your ears in senegalese/ghenese/ french and occasional english. There is a factory mood with several chairs laid out in two rows, each with a seated victim and hovering braiders. The victim can hardly see the mirror in the front because there is a heap of papers, handbags, hair and accessories piled in front of it. So, the victim and his hair is deeper under the control of the braiders than the language barrier would indicate. As the victim suffers, culture (50 cents on the radio), market (men pop in and out carrying CDs, socks, handbags and other merchandise for sale), and society (men come in and flirt with the young braiders or husbands come in sheepishly and get cash from the older braiders and the children fly in and out) flows. It is a feeling of total helplessness. But when ultimately one perseveres, triumphs, pays and leaves, one is rewarded with the check-that-out look of the kids in the street, and strangers act deferentially in subways and stores. Nirvana. Yes, I now have a new look. Black and red beads, short braids, goatee, but no moustache.

Sunday, April 16, 2006

Timeless in Manhattan

The Manhattan Theater Source on Washington Square West is an incredible resource for playwrights, actors and the village dwellers who want to rub some of that onto themselves or simply watch great productions for less than $20. They recently did a production called Spontaneous Combustion that is billed as "We write, rehearse, tech and perform an evening of 4-8 minute plays in 48 hours". What fun. There is the more established 48-hour film project which is a terrific cult with some astoundingly layered film products. Perhaps because I am an algorithmicist or because I can wire myself into a high energy state for short periods of time, projects made within time constraints appeal to me.

Matrix Approximation

I went through a phase in college when I had a preference for looking at problems through the linear-algebra lens. In the past two years, I have been looking at data that has a nice matrix form where rows are subjects and columns are attributes. The matrix is large and I need a summary in order to understand it. The typical way to do this is to get a low-rank approximation which will give a few "directions" for an analyst to peer at, but unfortunately, these directions are linear combinations of the column (or row) space and typically, it is hard to understand intuitively what they represent. It will be incredibly useful if one could find original columns or rows that summarized the matrix.

In the past few months, Petros Drineas, Michael Mahoney and I have worked on formulating this problem and solving it. Given matrix A, we are able to quickly find poly-k columns C (or rows or both) such that a matrix A_C obtained from these columns is a (1+\eps) approximation to best k-rank matrix A_k for A. Formally, ||A-A_C|| <= (1+\eps)||A-A_k||. The columns C then is a useful set of attributes for an analyst to understand and interpret. The technical result is here from November 2005. ||.|| is the Frobenius norm.

There is a somewhat related problem in theory community: how to approximate A_k? The A_k that minimizes ||A-A_k|| is given immediately by SVD. In theory, Alan Frieze, Ravi Kannan and Santosh Vempala asked if this could be approximated in a few passes. Since then, a lot of published work has shown how to get a matrix representation A' of A such that ||A-A'||<= ||A-A_k|| +\eps ||A||. This large additive error is discouraging and the question that remained open was, are there relative error approximations in few passes or in time less than it takes to solve the SVD? My result with Petros and Michael above provides such a matrix representation, but takes time to find largest k exact or approximate left-singular values, which can be done in O(k) passes using methods in Numerical Analysis. I mentioned my result above to Sariel Har-Peled at FSTTCS 2006 in India; at that time, he promised he can produce an A' such that ||A-A'||<= (1+\eps) ||A-A_k||. He delivered: his writeup is here. This does not seem to address the problem Petros, Michael and I worked on since it does not output original columns or rows, but produces linear combinations. Quite recently, Amit Deshpande and Santosh Vempala obtained a similar result but focused on using a small number -- O(klog k) -- passes, in some sense culminating a long line of research on low rank approximation using a small number of passes. It seems we can derive another solution to our problem from their result.

Summary: We are in a better place than we were a year ago when only additive approximations were known for low-rank matrix summaries. Both problems now have relative error approximations in near-linear time. And the fun part is, each of these results use a different technique. Bonanza! Those who are interested should look at these papers since there are many details.

Saturday, April 08, 2006

Music in Manhattan

I am tone and lyric deaf. I do not resonate with music in general. When I went for a walk last night by the White Street, I discovered the Archives of Contemporary Music at 54 White Street New York City, 10013 tel:212-226-6967, curiously open. It is a non-profit organization (hence, homepage http://www.arcmusic.org) that I found out carries more than 2 million recordings. It was great being in a room full of LPs in alluring jackets. Enough to get me started on music. Btw, Youssou N'Dour, one of my faves, is on the board of advisors to the ARC.

Friday, April 07, 2006

Ecology and Fiction

The by-now famous Life of Pi is the story of an Indian boy Pi and his survival from the shipwreck, alone on a boat with Richard Parker, a Bengal tiger, for company. This book has strange book reviews. Btw, your journey is not complete if you don't finish Pi off by reading Max and the Cats or vice versa. During the long journey, Pi and Richard Parker find a floating island of greenery that replenishes Pi with fresh water and Richard Parker with meerkats that he swats and kills for pleasure, rediscovering the primordial urge after days at sea. The main point however is that the greenery is a carnivore that generates killer acid in the night, consumes meerkats and humans, and spews out bones during the day. This episode in the book is a beautiful blend of ecology, magic and imagery.

Ecology is also at work in Amitav Ghosh's The Hungry Tide. This is a book set in the Sunderbans, a vast tidal area where the river Ganges terminates. The book captures the mood of living in a twilight zone where small islands disappear twice a day, there are river dolphins and sharks, and people ponder if the currents formed by Ganges and Brahmaputra as they plow into ocean do meet under the ocean and form a loop comprising a "river" that would then be longer than the Amazon (therefore becoming the second longest river in the world?). The days are languid with no focal point and the storms are destructive. In an incredible episode in the book, the main characters try to survive the eye of the storm passing through them.

I have been increasingly coming across fiction that embraces ecology.

Being in-place

In 1988, during an evocative summer a long time ago, I slaved away shaving-off bits needed to store a finite state machine (FSM) for a compiler I did for a company in the distantly alluring city of Pune, India. The FSM had to fit into a floppy or something and I still needed to index transitions out of each state quickly. Don't ask me why. I was an undergrad and I could easily get into solving a technical problem elaborately without wondering if I could get around it. But since then, once in a while, I come across a problem that directly or indirectly connects with that summer chore long ago. Here are two questions which in some twisted way reminded me of that past. (in-place means using no more than O(1) extra words.)

P1: Is there an O(n) time in-place radix sort? [I have a solution, but would have thought this is standard.]

P2: (This is a research problem.) Given a string S[1,n] in which each S[i] is a log n bit integer, design an efficient Lempel-Ziv like algorithm that works in-place.

Art of formulating a problem

I have not posted for a few days, I have been sick, mostly.

There is an art to formulating algorithmic problems from applied areas. For example, an applied problem is the following. We have a set of edges over time showing the communication between nodes. Intuitively, we would like to determine which group of nodes form a "group" or a "community".

One way to formalize it is to collapse the multiple edges between pairs of nodes into single edges, and find subset S of nodes that communicate with each other a lot (more than expected, or with others outside the set). This formulation quickly leads one to the k-subgraph problem which is hard to solve. There is a nice note on the complexity of finding such a community and its variants by Ara Hayrapetyan, David Kempe, Martin Pál and Zoya Svitkina.

An alternate notion of a community is found in a note by J. Baumes, M. Goldberg, M. Magdon-Ismail and W. Wallace, where the time is segmented into periods, and in each period i, a graph of communication G_i is constructed by collapsing multi-edges. Then the problem is posed as one of finding a subset of nodes that is connected in each of these graphs. The resulting problem, at least in its simplest form, is quite easy to solve. I like this formulation of the problem.

[Unfortunately, for every easy formulation, there is a hard variant.]