Sunday 8 June 2014


Open Science and the Web

TONY HEY
Microsoft Research Connections

VIDEO


OVERVIEW: Turing award winner, Jim Gray, envisioned a world where all research literature and all research data were online and interoperable. He believed that such a distributed, global digital library could significantly increase the research "information velocity" and improve the scientific productivity of researchers. The last decade has seen significant progress in the move towards open access to scholarly research publications and the removal of barriers to access and re-use. But barrier-free access to the literature alone only scratches the surface of what the revolution of data intensive science promises. Recently, in the US, the White House has called for federal agencies to make all research outputs (publications and data) openly available. But in order to make this effort effective, researchers need better tools to capture and curate their data, and Jim Gray called for 'letting 100 flowers bloom' when it came to research data tools. Universities have the opportunity and obligation to cultivate the next regeneration of professional data scientists who can help define, build, manage, and preserve the necessary data infrastructure. This talk will cover some of the recent progress made in open access and open data, and will discuss some of the opportunities ahead.

READINGS:
    Fox, G., Hey, T., & Trefethen, A. (2013). Where Does All the Data Come From?. Data-Intensive Science, 115.
    Hey, T. (2010). 
The next scientific revolutionHarv Bus Rev88(11), 56-63. The Fourth Paradigm: Data-Intensive Scientific Discovery Book 2009 

http://research.microsoft.com/en-us/collaboration/fourthparadigm/default.aspx 
http://eprints.rclis.org/9202/1/heyhey_final_web.pdf




33 comments:

  1. With the increasing popularity of machine learning in scientific research, are we moving away from research constrained by a priori hypotheses? What are the implications of favouring data-driven over hypothesis driven research?

    ReplyDelete
    Replies
    1. The two are complementary! One does not rule out of supersede the other.

      Delete
    2. I think with regards to data driven research, it becomes important that the ones conducting the research are experts who are well versed in the scientific problems, rather than just generalized "data experts", as Tony Hey mentioned. Then machine learning can be a useful tool for researchers who are already immersed in a field, but are limited by their ability to process large volumes of data.

      Delete
  2. Tony Hey: "The data will be the next battle ground". Dame Wendy Hall yesterday mentioned that software used for analysing the data should also published together with the research results, data, etc.

    I think all the macros, scripts, software packages, etc. should be made available in order for the research to be really reproducible. It could work just like open source software development: you clone a repository, you compile the source, you run script and that reproduces the whole analysis.

    There are two problems with that: you cannot publish commercial software (eg. from MS :)). The second is psychological: making all the whole research procedure accessible increases probability of finding mistakes in the research by others...

    ReplyDelete
    Replies
    1. To make all software packages publicly available the government or some other entity would need to fund the developers. Academics are funded and can afford to publish open access data. Alternatively, non-academic software developers need to charge to make a living.

      Delete
    2. The partial solution would be to not to publish software packages, but using open formats for data: http://en.wikipedia.org/wiki/Open_format. Something like Schema.org or Linked Data.

      Delete
  3. Est-ce qu’il existe un dépôt central Libre Accès (Open Access) qui se démarque et qui devient la norme pour le dépôt d’articles scientifiques dans les universités?

    ReplyDelete
    Replies
    1. Pas besoin d'on dépôt central: Les dépôts institutionnels des auteurs suffisent, et ensuite les contenus sont moissonnés par Google Scholar, SCOPUS et éventuellement WoS. On ne dépose pas en direct dans google non plus...

      Delete
  4. I find the idea of a “fourth paradigm” in scientific research quite fascinating. Tony Hey has mentioned the appearance of specialized “data analysts.” My question for Tony Hey is whether or not he thinks that this “fourth paradigm” might lead to a more extensive division of labor in research, such that some members of the research community might specialize in analyzing publicly available data sets, rather than generating for new data themselves.

    ReplyDelete
  5. Est-ce qu’il existe des méfaits à rendre les articles scientifiques complètement publique sous forme d’Accès Libre (Open Access)?

    Même des articles non-révisé par un comité pourrait être mis dans un dépôt Open Access. Comment vérifie-t-on la validité de ces articles? Il y a-t-il un niveau de confiance? Est-ce que les administrateurs des dépôts Libre Accès effectuent une évaluation de ces articles selon des critères?

    ReplyDelete
    Replies
    1. Les dépôts institutionnels ont des balises pour indiquer si le texte est expertisé, publié etc.

      Delete
  6. Do we need in future to Have an organisation like WWW to manage the open data access in order to harmonise metadata and also data ?

    ReplyDelete
  7. About the idea that scientists and programmers take so much time to get working together—how do we speed that up? Because when you're dealing with teams of students, 6 months over and over is a lot of time!

    ReplyDelete
  8. How do researchers follow the new law that NIH-funded research must be published open-access? Don't most academic journals reserve the rights to the publication? Does this force researchers to only publish in certain journals, or to only label NIH funding on only some of their work?

    The journals that do give the option to publish open access charge upwards of $2,000 to do so. Does the researcher have to front this money?

    Finally, if all researchers begin publishing in repositories such as arXiv, what will we use as 'academic' currency? Currently the impact factor of journals seems to matter the most. Will we take a positive step and start evaluating each paper individually? Perhaps through the number of citations to the specific paper rather than the journal?

    ReplyDelete
    Replies
    1. The solution for access during publisher open-access embargoes is to deposit immediately and let users rely on the repository's "Almost-OA" Button: http://j.mp/CopyRequestButton

      Arxiv contains both pre-refereeing preprints and refereed postprints. Peer review is still essential, and metadata tags indicate whether or not a paper is refereed.

      Delete
    2. Interesting system! Glad it exists.

      Delete
  9. Thanks Tony Hey. Interesting comments about Watson Jeopardy. I agree, It is really a simulation of a part of the intelligence.

    ReplyDelete
    Replies
    1. Every time something is achieved in artificial intelligence field, people say that this was not the 'real intelligence'. I hope that someday this thinking will reach the point when we will have to admit that the 'real intelligence' does not exist at all and we are all philosophical zombies :).

      Delete
    2. I am sure that I am conscious and that I can feel (based on Descartes proposition). Therefore, I cannot be a zombie.

      Whether we want to call feeling 'real intelligence' is another issue. I think we should say the AI is nearing 'human capacity' or nearing 'the ability to feel as us humans do' rather than using the term 'real intelligence'.

      Delete
    3. Philosophical zombies? http://users.ecs.soton.ac.uk/harnad/Papers/Harnad/harnad95.zombies.html

      Delete
    4. It's not that feeling is real intelligence. It is that real intelligence is felt.

      Delete
  10. In the very near future all data management and processing will be conducted by machine agents such as Watson (who 'knows' nothing about 'anything'). Already today it supports medical diagnosis decision making based on huge DB of medical data. Another example is a machine AI agent becoming a board member in a venture capital company: http://www.huffingtonpost.co.uk/2014/05/15/artificial-intelligence-board-directors_n_5329370.html. Point is that raw data will become opaque to human agents and mostly if not entirely mediated by machines with increasing processing and cognitive competences.

    ReplyDelete
    Replies
    1. Isn't this perspective a little bit radical and even scary? Are "machines with increasing processing and cognitive competences" capable of the judgement, creativity to come with new ideas or questions and flexibility when encountering new challenges that a human being is capable of?
      Who supports medical diagnosis decision based on machine-based analysis of medical data? I can't help to disagree. Medicine is one of the disciplines in which I believe a machine will never replace the criteria and flexibility of a skilled doctor.It is a discipline based on repeated observation, not only information analysis -taking aside the fact that partpf the "medical data" is biased and affected by many private interests and methods variety. Every patient is a unique human being and algorithms may not apply to each one of them. Patients from different countries or contexts neeed many times different approaches.Is amachine sensitive enough to perceive those subtleties and deal with them? I doubt it.
      Medicine is just an example, but even though I think the emergence of these new technologies and analysis possibilities are great, we should be very careful to conclude prematurely that "machine agents conducting data management and processing" will replace human understanding of phenomena around us.

      Delete
    2. Or did I misunderstand your comment? I do think these tools are useful, but I consider them only that: tools. Never a replacement of what a human mind can do.

      Delete
    3. Human doctors together with machine intelligence that can compare individual cases to huge databases of historical cases will present diagnostic and prognostic capabilities that are beyond anything available today. I do not think that we are at a stage of comparing the capacities of machine agents to humans. This might take a long while. Still, the complementarity of human and AI agents is a huge advantage. This, of course, cannot correct the unfortunate situation of corruption and private interests, but than this is not a problem of machines but of humans. Also the practices of human doctors is often biased or seriously distorted due to personal interests, pressures etc. So we cannot blame machines in that...

      Delete
  11. The data set tools visualisation are very interesting artifacts (especially maps), Can those tools make link between different dataset directly on this map in interactive way ?

    ReplyDelete
  12. I believe that universal open science is a wonderful goal to pursue, despite the conflicts that may arise because of economic interests of some particulars.I also believe the obstacles vary according to the scientific discipline.
    In medicine, for example, the existence of free platforms that gather recent advances on medical knowledge (i. e., medscape, epocrates) have changed the. clinical practice of its users for good. Having immediate access to trustable medical information is definitely beneficial not only for clinicians but also for their patients. However, given the private interests of the pharmaceutical industry, I believe the universal access to the original data that gave rise to some of the publications is still far from where we are. Also, some people believe that the universalization of medical knowledge may lead to auto-medication and mis-understanding of the information by patients, increasing the risk for disease worsening and secondary effects of certain drugs.
    How can we fight these arguments and spread the idea of open science to as many knowledge areas as possible?

    ReplyDelete
  13. I also have a question on Open Data: depending on where and how a rsearch project is pursued, differences in methodology and data collection may arise. Is there any kind of regulation software or tool through which we can assess the quality/homogeneity ofthe uploaded data, so that the analysis is valid everywhere?

    ReplyDelete
    Replies
    1. Not yet. Far from it! Especially across all possible fields!

      Delete
  14. I have been very interested on the idea of open source ontology, an ambitious project. Could anyone give me some references about it?

    ReplyDelete
  15. I’m very interested by the ideas of accessibility and the research "information velocity". I don’t think accessibility and connectivity between researchers and documents are sufficient to benefit all of the content of a scientific document. For my PhD project, I’m working on the notion of readability and transmission of complex ideas, and I’m studying how Wikipedia can help a reader to have a better understanding of a scientific article. How do you think it’s possible and urgent to improve the infrastructure and connectivity between documents to improve easyness and velocity in the transmission of innovation and complex scientific models?

    ReplyDelete
  16. The fourth paradigm sounds like a digitized and automated form of grounded theory (http://www.groundedtheoryonline.com/what-is-grounded-theory).

    ReplyDelete