Thursday, May 24, 2007

Data analysis

Data analysis presents many problems in practice. One has to collect all the relevant data first, clean it because most data sets have many problems (missing values, mismatched items, nulls, fictional entries, whatever) and explore the distributions before one can say anything meaningful. We need more research on putting some principles behind this process of data cleaning and addressing the data quality problems.

After the (hunting) gathering and cleaning part, one gets to the fun. Vincent Matossian recently sent me an analysis of the coauthor relationship from CiteSeer, in particular, a degree distribution of coauthors. Also, he sent me a picture of my coauthors as an example (I am sure there are other examples which are more fun).


Anonymous Anonymous said...

If you look at the data, the high end of the distribution is somewhat suspect. J. Dongarra and Jack Dongarra appear separately (both with over 300 co-authors). Presumably, he showed up twice in other people's lists as well. Moreover, several of the top collaborators could easily be the combination of different people given the relatively common names.

10:09 PM  
Blogger vincent said...

Thanks for linking to the data! since I'm the one who assembled it, I should preface it by stating that this data is clearly ambiguous as the names of the authors were extracted from bib files provided by citeseer and are not disambiguated by institution. I also considered the names just as they appeared in the authors field and used the "and" keyword as separator. So 'Y. Zhang' is certainly not pointing to one individual but probably dozens. For less common names the data can still reveal interesting evidence. Feel free to email me at vastinnocentaims at gmail.

8:47 AM  

Post a Comment

<< Home