Sunday, 8 June 2014


Scholarly Big Data: Information Extraction and Data Mining

Pennsylvania State University
Information Sciences and Technology




Overview: Collections of scholarly documents are usually not thought of as big data. However, large collections of scholarly documents often have many millions of publications, authors, citations, equations, figures, etc., and large scale related data and structures such as social networks, slides, data sets, etc. We discuss scholarly big data challenges, insights, methodologies and applications. We illustrate scholarly big data issues with examples of specialized search engines and recommendation systems based on the SeerSuite software. Using information extraction and data mining, we illustrate applications in such diverse areas as computer science, chemistry, archaeology, acknowledgements, citation recommendation, collaboration recommendation, and others.

READINGS:
    Khabsa, M & Giles, C.L. (2014) The Number of Scholarly Documents on the Web. PLOS ONE 10.1371/journal.pone.0093949
    Caragea, C., Wu, J., Ciobanu, A., Williams, K., Fernandez-Ramrez, J., Chen, H. H., ... & Giles, L. (2014). 
CiteSeer x: A Scholarly Big Dataset. In Advances in Information Retrieval (pp. 311-322). Springer International Publishing.

    Flake, G. W., Lawrence, S., Giles, C. L., & Coetzee, F. M. (2002). Self-organization and identification of web communitiesComputer35(3), 66-70.

20 comments:

  1. Giles says that certain name ethnicities are more and less inclined to collaborate with people whose names belong to the same ethnicity as their own. It reminds me of when Mr. Lariviere discussed trends in international collaboration. It would be interesting to compare Giles' data with that of Lariviere and see if the name ethnicities that prefer interethnic collaboration correlate with the countries/ institutions who like to collaborate internationally (Lariviere mentioned Belgium, Netherlands, Luxembourg).

    ReplyDelete
    Replies
    1. This is a great idea. Keep in mind we only look at 12 name ethnicities. However, the method can be extended with others given other name ethnicity data.

      Delete
  2. Citeseerx: Why just PDFs? http://citeseerx.ist.psu.edu/

    ReplyDelete
    Replies
    1. Other formats can be used, but most scholarly documents today are in PDF formats.

      Delete
  3. Name disambiguation: Why MeSH rather than most frequent non-stoplist words?

    ReplyDelete
  4. Dr. LEE GILES. I am interested to know more details about the experimental collaborator recommendation system. Do you have a web site or school e-mail address about it? Thank you.

    ReplyDelete
    Replies
    1. We have several papers on this; nearly all available from my homepage at http://clgiles.ist.psu.edu

      The collaboration webpage is
      http://collabseer.ist.psu.edu

      Delete
  5. I have a number of questions for Professor Giles. First, compared to other kinds of big data (in, say, astronomy, etc.), exactly how ‘big’ is scholarly big data? Professor Giles also mentioned that some disciplines have a higher proportion of open access papers than others. Why do you think papers in computer science (about 50% of which are open access) are more widely available than papers from other domains of research? I also would like to know what you consider the main challenges or obstacles to crawling, mining and archiving this magnitude of data.

    ReplyDelete
    Replies
    1. The least open access field on that list was agriculture. I imagine that's at the bottom of the list because new knowledge in this field is easily applicable to industry to make a lot of money. I imagine computer science and physics researchers need to collaborate more to advance their field and that the findings are not immediately applicable to earn a profit.

      Delete
    2. I agree with you that profitability is probably a major point here, Robert. I also think the differences between these fields has a lot to do with the kind of data gathering and information processing required by each field. One could not practice, say, astronomy, without sharing (huge amounts of) data today. The same probably isn’t (yet) true for agricultural science.

      Delete
    3. Good questions.

      Scholarly big data can be measured several ways: numbers of entities and size. Our estimate is that there are at least a 100 million scholarly documents which should be about 100 TB in size; text would be half or less. Each of these documents contains several authors, many citations and other useful metadata such as tables, figures, methodologies, etc. The number of citations should be in the billions. If data related to the document is stored, the size increases significantly. For example, link a paper in astronomy to the astronomy data that was processed and used. It would be easy to see such data comprising a Pbyte.

      It's important to note that computer scientists and physicists have a long tradition of putting their papers online - physicists in the arXiv and computer scientists on their homepages. Also note that much in computer science also has several commercial applications. It could be a research culture issue. It seems that any field computer science touches also ends up making their papers more accessible.

      Delete
  6. I have read that regular NLP engines use Wikipedia to create ontologies. Because your data processors are in specialized fields like chemistry and computer science, what do you use to develop your knowledge structures? It seems to me that the challenge in scholarly big data lies in accuracy in sentiment analysis, NLP, NER, page scraping, etc. rather than actually analyzing the data.

    ReplyDelete
  7. C’est dommage que si peu d’articles soient accessibles gratuitement. La connaissance n’est-elle pas quelque chose de public? Ou place-t-on la frontière entre le privé et le public au sujet de la connaissance?

    ReplyDelete
    Replies
    1. Ce que le créateur veut vendre se vend. Ce que le créateur veut offrir, c'est gratuit. Les chercheurs ne cherchent pas à vendre leurs articles...

      Delete
  8. Somewhere in this talk, Professor Giles mentioned a jeopardy game with Watson IBM engine, I'm wondering if in citeseer they incorporate some of the Watsons ability (searching methods, ranking methods..) !!

    ReplyDelete
    Replies
    1. In a way, we already have in terms of infrastructure. Watson was using a Solr/Lucene indexer and so do we.

      Delete
  9. Dear Lee, Thank you for this nice presentation ! I have one question: To structure and apply data mining, can you explain me why epistemic logic is not more used to formalize rules of extraction or categories of results in scholarly big data? I would like to think that epistemic logic offer great services to think about knowledges and their organisation in academics.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  10. I am somewhat dissatisfied with some of the assumptions made in "The Number of Scholarly Documents on the Public Web" paper. First, when estimating the number of publications available on the web, the authors assumed that each search engine was a random sampling of articles, but we know this is not true since they utilize crawlers and other algorithms to methodically index the web as mentioned by the authors in the context of automated requests, unless this is not the case for scholarly papers. Although, I have no solution for how to remedy this problem. Secondly, with respect to the field level analysis, I think it is a stretch to assume that a paper and its references come from the same field of study. A major theme in web science and this conference is inter-disciplinary research and this assumption runs contrary to that. Wouldn't it have been more accurate to use the journal that the paper was published in to determine field of study?

    ReplyDelete
  11. Thanks for sharing this nice post with us. Now-a-days web scraping is very important in every sector to collect the data and information. Because while manually collecting the data can consume the precious time. So best way to collection the data and all necessary things with the help of webcontentextractor which can save your time.

    ReplyDelete