Wednesday, September 28, 2011

Quick NY Trip

I had the guilty pleasure of a 2 days trip to NY city. My many thanks to the students who schlepped up to the city to work with me. Also, I walked into my favorite bagel place, and the owner says, "havent seen you a few days", and gives me my coffee -- 8 Oz, black, no sugar -- and my dollar back.

Managed to watch an "oldie" (of 1 year): cave of forgotten dreams, a 3D look into the Chauvet cave, charcoal cave paintings from 20k+ years ago by humans. The drawings are awesome, but the voice-over and the focus of the director and the film to dramatize the sociological aspects is really distracting. I was mostly focused on the technical aspects: was it one artist or many? different styles and strokes? any "erasure" and "modification" marks, or paintings the artist wanted to abandon or paintover? why didnt they paint plants and people, flowers and natural scenery, but only overwhelmingly horses and bison? etc.

Being in NY is about seeing things new. I watched the California is a place video about skateboarding in pools of foreclosed homes, cooawl!

Saturday, September 24, 2011

War of References

My post on Biases in References generated email and real conversations and one phenomenon that got abstracted:
  • Authors submit their paper q to a conf citing a paper p very prominently giving credit for prior work;
  • q gets accepted, authors submit a camera ready version with lukewarm reference to p making it sound contemporaneous;
  • After a few months/years, q is submitted, accepted and appears in the journal version, references to p now missing or p made to appear irrelevant.
The scenario above may be an extreme, but there seems to the case of citations that morph in tone, substance or even existence from submission to camera-ready to journal appearance of a paper.


Video Hangout Update

My experiment of being on Google+ video hangout with the world for an hour each week is on. We have discussed:
  • Algorithms for learning a matching in graphs.
  • Algorithms for top k under funky similarity measures.
  • Algorithmic issues in H-index and its variants.
  • Is there any dataset in Computational Biology that needs a MapReduce system?
  • Stochastic histogram approximations.
What hasnt worked so far:
  • People seem to invite me on my regular gmail and not the one I set up for this purpose: (Instructions: invite into your circle on your Google+ or email me, I will add you to my video hangout circle; then, at the suitable time, go to our Google+, you will see me hanging out, join the hangout!)
  • Hangout time does not work for some, so people have been reaching out to me out of band.
Here are new parameters.
  • No hangout this Wednesday Sept 28, next hangout will be Oct 5.
  • From now on, hangouts will be Wednesdays 8--9 CA time, 11--12 NY Time.
  • I am considering the idea of having hangouts on specific topics in the future: MapReduce algorithms, Compressed Sensing and sparse approximation algorithms, Differental Privacy and its Extensions, Revenue optimal mechanisms, Ad Exchanges: new problems, etc, Any feedback will be highly appreciated.


BWCA Notes

Here are very rough notes on Day 1. Tim tells me videos will be posted soon.

Tim started the workshop stating that Stanford Univ was doing a focused year of activity on theory of CS, and mentioned there will be workshops on social network algorithms and expanders in the future.

Avrim spoke about the premise that the objective we optimize in problems is usually a proxy for some underlying target. Assume: all c-approx are \eps close to desired target (eg., clustering to 1+\eps gives good grouping of proteins by functions). An example result [Blum, Balcan, Gupta 99] is then for c>1, can get O(\eps) close to target. He posed several open problems to attack this way: sparsest cut (focus on application to image processing, segment/partition. What if we assume that 10 approx has error \eps for image segmentation?), phylogenetic trees, other assumptions? (eg., most c-approx are close to the target), for "easy" problems, given arbitrary instance, find stable "portions" of solution. Eg., Bilu-Linial perturbation-stablity notion. Some of the questions: q1: as they say, clustering is either easy or not interesting. comments? A: key is difference between error and opt metric, not clustering opt per se. Alex q: are there smooth versions? like as property goes from structure to arbitrary, runing time automatically adjusts from quick to fast? A: Maybe look at PTAS examples? q: Is approx-stabilty a rare property? A meta theory? Therefore we should not use approx at all. A: Clustering does NOT have to be hard, so a formulation that makes it easy is not immediately challenging to the establishment. Tim: Any hardness results? So, the assumption is not too strong? Avrim is a researcher can drive the car with two gas pedals, one for math/CS techniques, one for formulation/abstraction/philosophy and can put the foot down on both or either as needed.

Nicole Immorlica spoke about PASS approximations, through the example of facility location. She made a clear argument that you wanted to consider variations that assume some property of the solution (not the instance). Cited the Kleinberg, Papadimitriou and Raghavan paper on catalog segmentation as an early example. She then described the PASS Framework: take a signature of the solution: for each facility, q_i, alpha_i, consider the recoverable value as a function of the signature, and do analysis in these terms. Greedy/LP gives additive error guarantees per facility, and provides guidance on choice of heuristics. Nicole is a clear speaker, often using simple examples to make the point.

Nina Balcan spoke about a very nice area, developing a discriminative theory of semi-supervised learning. Unlabeled data helps in many ways (self consistency, throw out hypotheses etc). Propose a new model. concept class + compatibility between the concept and data distribution. learn not C, but C under compatibility. Gives nice algorithms: First consider unlabeled data, choose a rep from a set of compatible hypotheses, then use labeled example to choose among these. How much unlabeled data is needed, how much can unlabeled data reduce number of labeled data needed? Pointed to Zhu survey 2010. Kakade, COLT 2008. q: Does this theory extent to active learning (where you get to pick questions)? Yes. Need to combine with compatiblity constraints. q: any killer applications? q: anything to choose from transitive SVM and … for any instance. A: run them in parallel and stop when justified.

Spielman spoke after lunch on smoothed complexity. He spoke about the role of the condition number in a phenomenally illustrative way. Theroem: ill conditioned ifff some row is near the span of the others. Eckert+Young: cond no = 1/distance to singular. Another condition no= ||A|| ||A^-1|| where || || is Frobenius. Gaussian elimination without pivoting: entires may not get large when submatrices are not ill-conditioned. matlab demo to show Gaussian elimination sucks. Towards the end of the talk, he spoke about solving set of polynomial equations, solving: n eq , degree d. a point in which Newton converges quadratically. (Smale def) Bezout theorem says many solutions. NP Hard. SAT x_i (1-x_i)=1 is poly clause. Recent result is a poly smoothed complexity. Result sounded amazing, need to follow up. Dan's talk and work makes theory research inspiring, as someone in the audience observed later, and no doubt others felt the same way.

After a break (I fell asleep some during the talks, I was tired, apologies to the speakers), Andrew Goldberg spoke about shortest path problems. The message was that there is science of algorithms: do, implement, observe, test, tweak, realgorithmize, and so on. In a great example of what algorithms research can do when thrown at a basic problem, Andrew gave a series of improvements to shortest path computations in particular, as they applied to road networks and map queries, relating VC Dimension to the complexity (HW: Unique shortest Path system in graphs has VC dimension \leq 2.) In road networks, there are transit edges which are natural and important and cut down on shortest paths to consider, store and compute with. Dijkistra's takes 2s for 18M vertices. Result now << 1s. q: Alex: reproducible experiments, are these reproducible?

Susanne Albers spoke about online competitive analysis. Paging, with resource augmentation. Paging/ k/k-h+1 competitive if OPT has h <= k pages for alg. For list update problem, proposed and analyzed a new locality model based on runs. q: Aranyak: What benchmark for running scheduling expos? A: Hard. sometimes need deadlines for energy efficiency. Tim: How many of these online algorithms are ideas that have usefulness elsewhere? online ad allocation will be an example.
Altogther, I thought the quality of the talks was very high on the first day. I could not follow the workshop with as much attention as it deserved on the second and third days (it was vaguely unsettling to hear Richard Karp talk using ppt slides. :))


Sunday, September 18, 2011

BWCA at Stanford

Beyond worst case analysis workshop starts tomorrow. Tim and Luca have put together a superlative program. I hope to learn as well as meet friends from afar, in time and space.


Life after a Tweet

A tweet is short (and I am told, in china too, its substitute like Sina Weibo has a 140 characters upper bound! :)) and real time. What interests is life after the tweet.
  • When US went into Pakistan, US forces had practiced, had plenty of firepower and stealth, backup fighting force and eyes on the sky, I am sure. As we all now know, there was someone tweeting about the choppers even as the operation was underway, raising a severe security issue. Should US have considered shutting down twitter for an hour as a standard operational procedure before going in?
  • Then I heard the story of this woman on the radio who was unhappy with the service at the bar and tweeted. A few minutes later, the manager comes to the bar and escorts her out! Life after the tweet is also swift.

Friday, September 16, 2011

NYCE from Far

NYCE in NY did take place and thanks to Gagan Goel, I managed to get the gestalt of being there (although alas, I had to miss the afternoon talks).

Research Real Estate

Here is the view from our own Nitish Korula's office at Google Research, NY.

Wednesday, September 14, 2011


Thanks to the organizers Gagan, Mallesh, Richard, Sharad and Xi, NY area CS and Economics (NYCE) day is this Friday. The organizers have done a splendid job of recruiting a new space: NYU's Kimmel Center in the village.

There is a great set of speakers : Vincent Conitzer, Jonathan Levin, Preston McAfee, David Parkes, and Eva Tardos. The call for rump session talks/submissions has been vastly oversubscribed, so I expect a large participation. Alas, I wont be in NY this time physically.


Tuesday, September 13, 2011

Biases in References

There is a lot of work on bibliographics. Is there a good study of the biases in references in research papers?
There are many, many ways to subtly over or under represent prior/earliest work. One might
  • not cite a prior work where it is needed, or
  • cite a later work while an earlier work is more appropriate, or
  • cite all relevant work but weight improperly by say placing a later work at the same level as the earliest work, by repeating references to a later work more, or slant natural language discussion towards the later work, and so on.
This may be done for
  • selfish reasons (self-citations),
  • social reasons (citations to students, friends, colleagues or ones we trust or know), or
  • antagonistic reasons (jealousy towards a perceived peer, animosity towards someone for a perceived past slight, competition with another group, and so on).
I see a lot of work on bias in the scientific opinion of an author, or bias in selection and so on, but very little on biases in references or systematic ways to analyze and quantify them.


Sunday, September 04, 2011

Algorithmicist in the Field: Google+ Weekly

Labor day weekend is a good landmark, interns are gone and stalled projects in research labs will restart; students are back at universities and new semester will begin; summer will end and fall will burst out; it is time for last hikes and new beginnings, etc.

I will be on video hangout on Google+ every week, Wednesdays 9 AM -- 10 AM EDT (NY Time). Anyone can join me on the hangout. The idea is, we can talk algorithms in theory, systems, field or otherwise; we can talk problems, solutions, patents, techniques or angst about the state of the world; we can look ahead and seek directions or look behind and summarize; we can talk courses, careers, conferences or about the cult of research; we can be students, professors, researchers, and be in any part of the world.

What you need to do, if you want to join me in my hangout:
  • Have a google+ account. If you dont have one, get an invite or email me at
  • Go to Google+, add me (muthubyshance) to your circle and join my hangout anytime during the period specified above. If you cant find me or have problems, email me at You will also get updates from me on google+ if I am in one of your circles.
This is obviously an experiment, but for some time, I wanted a channel to have real time conversation about our issues in addition to the existing channels of the single speaker on the pulpit mode (blogs/lectures) or the post and seek mode (of or hunt and find the proper subset of people mode (of live conferences).

ps: This will start on Wed, Sept 7.


Two Algorithmic Problems

Sometimes, an applied problem brings up a technical problem, simple or not, and it is fun. Here are two examples (from social networks):
  • Given an undirected graph G, output a set S of vertices (the neighbor-dominating set) so that each edge should have at least one common neighbor in S if the endpoints are not in S. This sounds related to vertex cover, dominating set etc., and there is a simple log approximation possible. The problem arose in an interesting proposal for decentralized social networks (as opposed to centralized ones like facebook) by my colleagues Lu Han, Liviu Iftode and Badri Nath of Rutgers CS and the paper will appear in the upcoming Social Computing 2011 conference.
  • Given a directed graph G, determine a mapping r from vertices to set of integers such that sum_{(u,v)\in E} max(r(u) − r(v) + 1,0) is minimized. The interpretation is as follows (and arose due to the work of my collagues Mangesh Gupte, Liviu Iftode and Pravin Shankar of Rutgers CS and REU visitor Jing Li): In social networks, where nodes are aware of their ranks, higher rank nodes are less likely to connect to lower rank nodes. Hence, directed edges that go from lower rank nodes to higher rank nodes are more prevalent than edges that go in the other direction. In particular, if r(u) < r(v) then, edge u → v is expected and does not cause any “agony” to u. However, if r(u) ≥ r(v), then edge u → v causes agony to the user u and the amount of agony depends on the difference between their ranks. We posit that the agony caused to u by each such reverse edge is equal to r(u) − r(v) + 1 (+1 is for technical reasons). Then the problem of finding best social hierarchy can be stated as above. Unlike a related way to formalize this problem as feedback arc set problem, this problem is solvable in polynomial time as we show in the paper that appears in WWW 2011.


Saturday, September 03, 2011

VLDB 2011 Notes

I managed to make it to VLDB at Seattle. The travel this time --- from SFO to SEA --- was a treat, with a taxi ride with Raghu, discussing what each of us were looking forward to learn at VLDB. On Tuesday, I got there in time for the receptions, the first hosted by Microsoft in the evening, and the second, hosted by Google, later. The party continued later, later. I stayed at Mayflower Park, an old fashioned hotel with, as is standard, grand lobby, a highly helpful bartender at the bar in the corner, staff in glittering buttons, and small rooms.

On Wednesday, David Campbell gave a plenary talk. He said that there was data ambient, models and modeling methods rare, and one could be very creative with questions. He used the example of ``digital shoebox'', dropping in all the data one could collect say with their smartphone, and interesting analyses you could do with the shoebox ("can you figure out how much time you spend in Florida in a year for tax purpose."). One in the audience asked, "what about digital shipping container?", that is, when you get data from many, will methods scale? He also spoke about the lifecycle of a query, how to validate data, hypotheses, etc. Audience asked about DB concerns, what happens when models change, patterns mature, how to admin the DB.

I attended the panel on maximizing impact with Ed Lazowska, David DeWitt, Juliana Freire, Ed Lazowska, Sam Madden, and Jennifer Widom.
  • Ed set up the discussion with pointing to the article of Bob Lucky about the diminishing influence of electrical engineers, and the iconic last engineer on the planet. Is that where database engineers are, even though this is the decade of data? He also pointed the audience to kaggle for data analysis competitions.
  • Sam did the crowdpleasing exercise of wordling VLDB paper titles for the past three decades and no one resisted the temptation to draw conclusions. Then he did a crowdsourcing exercise of (a) getting list of top 10 impactful ideas/achievements in database research in the past 10 years, and (b) top 10 new ideas for the future. Stream processing made top 4 in (a), yesss! In (b), there were several suggestions, including "data crumbs" by Alon Halevy, of anlayzing data we leave behind during our trail on the web.
  • Jennifer then led the discussion, focusing more on education and visibility. Why is AI or machine learning more popular than databases research among incoming student in this decade of data? The explanation was technical (students dont know real world and dont see the need for DB until much later) to marketing (students see driver-less car as AI, that is great image; in contrast, as Natasha put it, the DB product feels like a screwdriver or laptop, not necessarily fun), self-reflecting (shouldnt we take responsibility to go from data analysis ideas to eventual studies and code in an applied area) and educational (better examples than the employee/student example in text books). Joe Hellerstein, who has the incredible ability to capture and keep the intellectual attention of the crowd, kept the conversation focused on what DB community has to offer: getting inspiring ideas from other communities, doing DB magic of declarative languages, parallelization tools, etc. He also called for a simple language to teach DB basics and early programming in undergraduate days.
  • David steered the conversation towards the positives (DB students are finding jobs, startups companies are making deals, 50B database community is humming, etc) and lamented the challenges of having impact in practice, even in Industry.
The panel made sure the conversation did not turn to remaking conferences, and the panel dissolved into multiple conversations in the hallways. Should we be following developments at NoSQL conference?

Had terrific conversations with Amr, Divy, Hector, Laks, Joe H, and Surajit about research, and with Amol and Jignesh, about challenges of mentoring students.


Thursday, September 01, 2011

100k+ Theory Students in the class?

Stanford is running an experiment with free online courses in AI (by Peter Norvig and Sebastian Thrun) with 100k+ students, databases (by Jennifer Widom) with 40k+ students and machine learning (by Andrew Ng) with 40k+ students worldwide. Where is the free online course on Algorithms and Theory? These courses will have homeworks, office hours and grades.