Monday, December 08, 2014

On Urban Planning and Story Telling

When I was in University (it doesnt matter which prefecture), I enrolled for a class in Urban Planning. Now I can guess many reasons why I might have done that, I was 20 yrs old, and I focused more on easy grades and good looking fellow students than learning. Whatever the reason, I didnt really make it to class all semester except once. That day, I was drinking my tea in the students center as usual, reading the Monkey King, and pondering how I kept forgetting the monkey's name. I happened to talk to a student, she was easy on eyes, and our conversation flowed and before I knew it, I accompanied her to the class, which coincidentally turned out to be Urban Planning. Her father was a government official in-charge of the local Dept of Buildings, and she really cared about Urban Planning. I dont know why, but to this day I remember what happened in the class. The professor taught us about zoning (how buildings have to be set back a fixed amount from the street) and water runoff (how to build catch basin and french drains to capture runoff from neighbors).

I told this to my friend Haruki in college, and he later told me he wrote a short story about it. I didnt think I had much of a story but I read Haruki's and you know, he is a real writer, he can imagine things I cant even contemplate, his story was creative and went places my mind couldnt be dragged. In the end, it was not my story at all, it could only have come out of Haruki's mind.

But my story continues. Years later, I bought a place that needed a lot of work. I could easily build my own fence because I knew what the setback was, and I built a catch basin too and watched the runoff from my neighbors property.

Sunday, December 07, 2014

Data Science and Online Ads: Panel in NYCE 2014

Thanks to Arash Asadpour, Mohammad Hossein Bateni and Alex Slivkins for organizing the 2014 NY Area CS and Econ (NYCE) day in NY.   I will let the organizers blog about the day, it was a superb program.

I organized a panel on Data Science in Online ads. The panelist are stars and represented a constellation of perspectives in this complex ecosystem.

  • Matt Curcio is from Neustar (via Aggregate Knowledge). He builds and supports a neutral data platform for advertisers to gather and analyze ads data. He is remarkably broad, using data streaming to data privacy in this work. He spoke about the challenge of getting data scientists to collaborate, no matter the company they worked. 
  • Chris Wiggins is now the Chief Data Scientist at NY Times. He spoke about data products at NYT and estimating Long Term Value (LTV) of users. He also talked about placing house ads as an example of reenforcement learning. 
  • Aparna Pappu runs AdX at Google. She focused on AdX and described the goal of fair transfer of value from advertiser to publishers. She mentioned many specific data issues: there are gaps in their data since they dont observe all online events; data viz is hard; there is asymmetry of info since advertisers may know about users than any specific publisher; AdX can not share data equally with all parties; and finally, she spoke about great diversity of data they have so it is hard to find natural segmentations of publishers and advertisers. 
  • Paul Barford is now the Chief Scientist at Comscore, Inc following their acquisition of his Mdot. He described his consulting experience at BIM which is a publisher network that led him to the problem of detecting fraud clicks. He said, when systems are complex and there is money involved, there are bad actors, ie, fraud is a problem. Further, he quoted that data science starts with measurement and it is hard to gather data from ads platforms. 
  • Neal Richter is now the CTO at Rubicon Project. He spoke about how ads sales is changing into being automatic, and a challenge in petabytes of analyses he does with 200B transactions a day is to make the analyses and conclusions explainable to others including biz folks. 
  • Catherine Williams is Head of Data Sciences at AppNexus. She described AppNexus as the largest independent (non FB/GOOG) programmatic media co. and not involved with PII. AppNexus has a performance marketplace which is nice. She spoke about the challenge of suitable incentives for various types of content and stretched us to consider freedom of speech issues when we emphasize one type of content over the others. 
  • Claudia Perlich is Chief Scientist at Dstillery. She spoke about the predictive modeling and machine learning challenges in prospecting for ad targets. In particular she pointed out that it is not as much about predicting if you will buy X as it is to convince/convert you to buy X. She quipped that from their lat/long data of users, 30% os US population travels above the speed of sound! She also talked about how not to look at artificial metrics to improve in ad platforms, and the challenges of getting performance data from networks. 
  • Jon Krohn is a Data Scientist in the orbit of Omnicom, a large media company. He started with the observation that you need data to spend money well, and went to the board to draw the ``river of money'' from advertisers to agencies and media companies like his, to eventually publishers, with $s dwindling along the way with 20--30 hands that touch the transaction. 
I summarized their presentations. Discussions ensued:
  • Costis Maglaras asked, is the ad market going to be like DJ with small transaction cost or like Christies with XX% cut? Goods in ads are ephermeral, valued differently by different parties and cant be retraded, so not clear financial analogies apply. 
  • Vahab Mirrokni asked, is the ad market converging to reservations/allocation or auctions? Catherine mentioned that platforms like AppNexus are supporting many different types of markets from reservations to private packages/deals to RTB and performance. 
  • I asked if large distributed ML package that searches automatically over models and parameters will suffice for ad business. No, because information is not complete, players may not be rational, not single objective optimization, signal is weak, moving targets, etc. 
  • I asked why more of microeconomic concepts didnt penetrate ad markets, like substitutable goods. This is because publishers dont think their inventory is substitutable, and there are handcuffs around who owns data and privacy isseus, so data permissions dont let this info be usable. Paul Barford said Comscore is an exception of data and he was willing to work with academics on data access. 
I now have the formula for a great panel: recruit great professionals, let them go, and sit back. I enjoyed the panel immensely. I hope researchers connect with the folks above, there is a lot we can gain. It was good to sneak into NY academic scene, if only briefly. 


Saturday, November 29, 2014

Workshop on Graph Streams (Sandia/DIMACS)

Here are some notes from the workshop, superbly organized by the Sandia team.

  • Workshops are hard to organize, and you have to have a large purpose to do the work. The Sandia team of Bruce Hendrickson, Jon Berry, Cynthia Phillips, and others has a scholarly attitude, which was truly refreshing. There is genuine interest in Sandia, from US Govt IP network (mix of classified, unclassified, specialized) monitoring applications to new theoretical graph stream models, and an empirical approach based on setting up synthetic dataset, benchmark tools and infrastructure systems. I didnt know Livermore has a Sandia Lab, with Kevin Matulef, C. Seshadri and others. This is some nice research horse and brain power for streaming research at Sandia.  They had a uber-data context: some stored data, some sampled, some streaming hose, how to process them all with a combination of multiple machines, cloud, etc. Will wait for Jon to put his slides online where this model was clearer. 
  • Attending a workshop even for a  day is a welcome break to think about problems. Here are vague questions. Somebody out there may have something to say (incl. shooting down the problems): (a) say characters of a string arrive online, produce a uniformly random sample substring. Detail: good for string seen thus far, represent the substring by O(1) sized representation of left and right endpoints, ... (b) the contents of a file are sent by breaking into substrings in IP packets, but substrings are sometimes repeated, sometimes substrings are overlapping in arbitrary ways (due to TCP resend). Is there a coding/decoding solution that tradeoffs coding quality to sublinear space reconstruction? (c) each new stream item is a string. have to find substrings of each stream item that appears a lot of times thus far. If the lengths of strings is L, can you avoid doing O(L^2) work per item and/or use space less than exp in L. 
  • Distractions. Cindy said, "that is the last edge that broke the camel's back". Sudipto used the phrase, "the right side of Buddha". Madhav could not attend the workshop because he had to respond to the Ebola threat. 


Algorithms in the Field (8F)

NSF announces a new funding program for Algorithms in the Field. Deadline is Feb 9, 2015.  One of the metrics in Algorithms community is the ultimate use of our algorithms, ``use" being broadly interpreted, and this often needs us to go more than halfway to meet other communities. This program is an opportunity to codify the process some. When one does meet the other communities, almost always it leads to new theories and algorithms, and more than pays for the journey. I hope you will respond.

Here is more info on the workshop we organized 2 years ago. The videos of the talks are here.


Wednesday, November 05, 2014

Cornell CS 50

I like the historical context for things, and in CS, Cornell is a center.  Enjoyed reading about Cornell CS 50th celebration


Sunday, October 12, 2014

Streaming Extravaganza

ICDT/EDBT is a premier theory centric database conference pair. In the 2015 version, we will have Invited/Keynote talks by Graham Cormode, Rasmus Pagh, Nicole Schweikardt, all well known in streaming/sublinear algorithms world. 


Tuesday, October 07, 2014

Meetings of Interest

In the near horizon:
DIMACS and Sandia are running an invite-only Workshop on Streaming Graph Algorithms, Oct 23--24 at Sandia Labs. The program is not public yet but the version I know is exciting. I will be running an open problems session, trying to identify some broad trends and specific problems, within the context of streams of graph edges.

In a slightly longer horizon:
Indian Inst of Sc is running a Workshop on Learning, Algorithms and Complexity, Jan 5--9, at Bangalore. The speakers list is superb. If you are interested in participating, send an email to the organizers. 


Sunday, October 05, 2014

Sciences, Math and Arts

I am curating some art for a performance or display space, art that is hopefully in the intersection of Science/Math/Technology and Art. Please email or respond if you have a pointer. A triptych of videos below: Box (synthesis of real and digital space on moving surfaces), Sparked by Cirque (quadcopters in swirl), an animation about balance (an oldie, I watched it in early 90's).