On How Science gets Written

Here is a recent research. The starting point is the observation that arXiv stores papers in LaTeX form! So we crawl them and analyze what we can mine from LaTeX form --- the “source code” of scientific papers --- that is difficult or impossible to do with the "finished product" like PDF or PS (think "comments" for example. :)). Beyond this amusement, there is a bigger point: there is a large pipeline of how science is
  • conceived (apple hits Newton's head or Archimedes runs the streets of Syracusa: the stuff for science novels),
  • executed (experiments, proofs in cafes: stuff that movies show with soundtrack),
  • communicated (written, published, presented) and
  • ultimately impacts the world (citation analysis, analysis of social network of researchers: the stuff that holds academia together).
Traditionally, we have little visibility into large parts of this pipeline; our argument is that now with emerging large scale data, we can make study of this pipeline itself an object of principled study.

We focus on how science is written, an area we call Scienceography. Our study identifies broad patterns and trends in two example areas—computer science and mathematics—as well as highlights key differences in the way that science is written in these fields. (Party Puzzle: What math operator symbol is most commonly used in CS vs Math?). Comments/suggestions for other studies very welcome (yes, this is only the beginning). Also, serious caveat: this is a data analysis paper, don't expect well packaged principles or theorems.



