Web Science and the Mind

Sunday, 8 June 2014

Scholarly Big Data: Information Extraction and Data Mining

LEE GILES

Pennsylvania State University
Information Sciences and Technology

Overview: Collections of scholarly documents are usually not thought of as big data. However, large collections of scholarly documents often have many millions of publications, authors, citations, equations, figures, etc., and large scale related data and structures such as social networks, slides, data sets, etc. We discuss scholarly big data challenges, insights, methodologies and applications. We illustrate scholarly big data issues with examples of specialized search engines and recommendation systems based on the SeerSuite software. Using information extraction and data mining, we illustrate applications in such diverse areas as computer science, chemistry, archaeology, acknowledgements, citation recommendation, collaboration recommendation, and others.

READINGS:

Khabsa, M & Giles, C.L. (2014) The Number of Scholarly Documents on the Web. PLOS ONE 10.1371/journal.pone.0093949
Caragea, C., Wu, J., Ciobanu, A., Williams, K., Fernandez-Ramrez, J., Chen, H. H., ... & Giles, L. (2014). CiteSeer x: A Scholarly Big Dataset. In Advances in Information Retrieval (pp. 311-322). Springer International Publishing.
Flake, G. W., Lawrence, S., Giles, C. L., & Coetzee, F. M. (2002). Self-organization and identification of web communities. Computer, 35(3), 66-70.

20 comments:

Rachel9 July 2014 at 12:29
Giles says that certain name ethnicities are more and less inclined to collaborate with people whose names belong to the same ethnicity as their own. It reminds me of when Mr. Lariviere discussed trends in international collaboration. It would be interesting to compare Giles' data with that of Lariviere and see if the name ethnicities that prefer interethnic collaboration correlate with the countries/ institutions who like to collaborate internationally (Lariviere mentioned Belgium, Netherlands, Luxembourg).
ReplyDelete
Replies
Instructor9 July 2014 at 12:36
Citeseerx: Why just PDFs? http://citeseerx.ist.psu.edu/
ReplyDelete
Replies
Louis Chartrand9 July 2014 at 12:44
Name disambiguation: Why MeSH rather than most frequent non-stoplist words?
ReplyDelete
Replies
Unknown9 July 2014 at 12:49
Dr. LEE GILES. I am interested to know more details about the experimental collaborator recommendation system. Do you have a web site or school e-mail address about it? Thank you.
ReplyDelete
Replies
Maxwell J. D. Ramstead9 July 2014 at 12:52
I have a number of questions for Professor Giles. First, compared to other kinds of big data (in, say, astronomy, etc.), exactly how ‘big’ is scholarly big data? Professor Giles also mentioned that some disciplines have a higher proportion of open access papers than others. Why do you think papers in computer science (about 50% of which are open access) are more widely available than papers from other domains of research? I also would like to know what you consider the main challenges or obstacles to crawling, mining and archiving this magnitude of data.
ReplyDelete
Replies
Rachel9 July 2014 at 13:00
I have read that regular NLP engines use Wikipedia to create ontologies. Because your data processors are in specialized fields like chemistry and computer science, what do you use to develop your knowledge structures? It seems to me that the challenge in scholarly big data lies in accuracy in sentiment analysis, NLP, NER, page scraping, etc. rather than actually analyzing the data.
ReplyDelete
Replies
Unknown9 July 2014 at 13:02
C’est dommage que si peu d’articles soient accessibles gratuitement. La connaissance n’est-elle pas quelque chose de public? Ou place-t-on la frontière entre le privé et le public au sujet de la connaissance?
ReplyDelete
Replies
Eltaani Redha9 July 2014 at 13:03
Somewhere in this talk, Professor Giles mentioned a jeopardy game with Watson IBM engine, I'm wondering if in citeseer they incorporate some of the Watsons ability (searching methods, ranking methods..) !!
ReplyDelete
Replies
Unknown16 July 2014 at 19:46
Dear Lee, Thank you for this nice presentation ! I have one question: To structure and apply data mining, can you explain me why epistemic logic is not more used to formalize rules of extraction or categories of results in scholarly big data? I would like to think that epistemic logic offer great services to think about knowledges and their organisation in academics.
ReplyDelete
Replies
Unknown10 August 2014 at 10:28
I am somewhat dissatisfied with some of the assumptions made in "The Number of Scholarly Documents on the Public Web" paper. First, when estimating the number of publications available on the web, the authors assumed that each search engine was a random sampling of articles, but we know this is not true since they utilize crawlers and other algorithms to methodically index the web as mentioned by the authors in the context of automated requests, unless this is not the case for scholarly papers. Although, I have no solution for how to remedy this problem. Secondly, with respect to the field level analysis, I think it is a stretch to assume that a paper and its references come from the same field of study. A major theme in web science and this conference is inter-disciplinary research and this assumption runs contrary to that. Wouldn't it have been more accurate to use the journal that the paper was published in to determine field of study?
ReplyDelete
Replies
Unknown6 May 2015 at 03:51
Thanks for sharing this nice post with us. Now-a-days web scraping is very important in every sector to collect the data and information. Because while manually collecting the data can consume the precious time. So best way to collection the data and all necessary things with the help of webcontentextractor which can save your time.
ReplyDelete
Replies

Add comment