Natural Language Processing on the Web
Overview: Even with the advent of the semantic web, most of the content available on the web is still in natural language, more than half of it in English, but more and more of it in other languages also. We will present some of the links (pun intended) between natural language processing (NLP) and the web: how NLP helps in processing information on the web, but also how web technologies help in the development of NLP technologies.
READINGS
Lapalme, G. (2013) XML: Looking at the Forest Instead of the Trees
Lapalme, G., P. Langlais, and F. Gotti (2012) The Bilingual Concordancer TransSearch, NAACL 2012
Gotti, F., P. Langlais, and G. Lapalme (2014) Designing a Machine Translation System for Canadian Weather Warnings: a Case Study, Natural Language Engineering 20(3): 399-433
Lapalme, G., P. Langlais, and F. Gotti (2012) The Bilingual Concordancer TransSearch, NAACL 2012
Gotti, F., P. Langlais, and G. Lapalme (2014) Designing a Machine Translation System for Canadian Weather Warnings: a Case Study, Natural Language Engineering 20(3): 399-433
A recent fad in sentiment analysis is to create a software that can detect sarcasm and false positives (spawned by the bid put out by the US Secret Service). Do you think it is possible to create an NLP software that does this? It seems to me that a person detects sarcasm because she knows the context. If a celebrity wears a really ugly dress and someone tweets, "That celebrity's dress is the prettiest thing I have ever seen," only people who have seen the dress and can evaluate its prettiness can detect the tweet's sarcasm. Do you have any ideas for how a machine could overcome this?
ReplyDeleteI do not think this is the problem in NLP domain. If you say the same to a human, he/she will not understand it without seeing the dress. So you have to show the dress to a machine and make it able to make an aesthetic judgement, which is something socially constructed, imho. So I would guess that what you are referring to is some sort of AI engine more than NLP software...
DeleteI categorize it in the NLP domain because it is a facet of analyzing human language to facilitate communication between man and machine. Perhaps sarcasm is a part of human language that requires contextual knowledge, so AI is necessary for the evolution of NLP. But I would not separate the two fields.
DeletePerhaps the sarcasm detection software could analyze related tweets and deduce the intent behind the ugly dress comment. We can learn a lot about online content by looking at the content it is linked to.
DeleteI do not think that NLP will ever be able to detect the sarcasm in one specific tweet or message. The power of the web comes from the number of different e-mails, so that it is possible to detect tendencies and main ideas. This is the law of big numbers at works, I do not think that anything and in fact any human car appreciate sarcasm from a single message, you need to take into account the context and I would be very surprised that US secret services will follow one specific message, they a interested in general trends, not specifics.
DeleteDear Guy, Thank you very much for this presentation ! I have a branch of questions : It’s fascinating that robots are more present than humans on the web ! So, the global brain will be firstly from or for robots than for humans? We should give more precise statistics about this interpretation. In this thread, what’s the main role of NLP and the main challenge for the future to build thinking machines? In a point of view for philosophy of mind, does NLP can model intentionality?
ReplyDeleteAlthough there are a lot of robots on the Web, they do what humans programmed them to do. Would a community who shares all there knowledge through a library be more book than human? Is the collaboration of humans and libraries more for the books or for the humans?
DeleteI think the Global Brain is humans and the tools they use; made for humans and their desires. (That is of course if a Global Brain even exists).
I was just giving the information about the omnipresence of robots to motivate the need of NLP for organizing data on the web so that it can be read by machines while being written by humans in the context of the semantic web. You should not extend this affirmation too strongly in the area of philosophy or in the architecture of the humans mind.
DeleteNatural language is deeply anchored in the world around us. With all the advancements in NLP isn't it correct to say that computers will never fully understand natural language without knowledge of the world? This knowledge however is not fully encoded in language. Therefore, isn't it correct to say that computers must acquire wide cognitive competences in order to be able to interact in the physical dimension and as a consequence to evolve language capabilities?
ReplyDeleteIn fact, humans also "will never fully understand natural language", so you should not ask too much from the machines.
DeleteYou said that identical subject, property and object must have the same URL. But sometime one thing has two different names, as it’s the case for Venus and the evening star. What could be done for this kind of case?
ReplyDeleteThere is a provision in all Semantic Web formalisms to assert that two URI are equivalent so they will be dealt specially for inferencing. But the fundamental problem of a correct identification of an entity and linking it with (one of) the correct URI(s) is still a complex problem. You surely not want to give a different URI to each occurrence of the same entity.
DeleteThis was a very helpful presentation. Professor Lapalme mentioned that NLP is everywhere on the Web already. I have a naïve question. What would the Web look like without NLP? What are the main functionalities that we take for granted, which would no longer be available without NLP?
ReplyDeleteWithout NLP, you would not have any Search Engine, nor Google translate... so you would have to use keywords combined with logical operators to search for information on the web.
DeleteJe voudrais savoir comment est le «Big Data» relié en NLP? Est-ce qu'ils font partie du même domaine d'application, ou elles font partie de deux domaines distincts où les experts s'entraident?
ReplyDeleteAussi, j'entends beaucoup de professionels dire qu'ils travaillent dans le domaine de IR et d'autres en NLP. Est-ce qu'ils sont deux domaines d'application distincts? Ou bien NLP est simplement une généralisation de IR?
Strictement le Big Data est indépendant du NLP, car on peut imaginer des situations où il y a de grandes quantités de données numériques à traiter: p.ex. des résultats d'expériences scientifiques, des données météorologiques, des informations venant d'appareils médicaux de mesure, etc.
DeleteMaintenant la quantité de texte disponible sur le web, rend la langue naturelle un sujet de big data. Nous participons d'ailleurs à un projet portant sur le "Big Text Data" financé par le CRSNG.
Quelle est la différence entre le domaine de IE et le domaine de IR?
ReplyDeleteIR (information retrieval) cherche à trouver des documents qui traitent d'un certain sujet et c'est au lecteur de trouver la réponse.
DeleteIE (information extraction) cherche à donner la réponse directement, possiblement en tirant de l'information depuis un ou plusieurs documents.
Are there any evolutionary methods of NLP? I.e. usage of genetic algorithms to breed better language processors? How is this field impacted by new methods of AI such as deep learning?
ReplyDeleteThanks for a very clear talk!
I am not aware of any specific work (or success) in the area of genetic algorithms for NLP. For the moment, genetic algorithms and deep learning are more in the area of "perception" (basic letter or speech recognition) than in the more high-level understanding.
DeleteActuellement, est-ce qu'il existe une implémentation d'une grammaire formelle complète d'une langue naturelle telle que le Français ou l'Anglais? Est-ce un domaine qui intéresse beaucoup de chercheurs ou bien nous sommes dans un cul de sac? Qu'est-ce vous en pensez?
ReplyDeleteUne langue naturelle n'est pas une langue formelle car elle évolue au mesure que les gens l'utilisent. Il existe toutefois des analyseurs de l'anglais et du français qui sont quand même raisonnables (Talismane pour le français, Stanford Parser pour l'anglais) qui permettent de fournir une analyse pour près de 80% des phrases courantes. Comme la langue naturelle est souvent ambiguë, leur taux de succès s'en ressent, mais c'est souvent suffisant pour avoir des applications intéressantes: détection de fautes d'orthographe et de syntaxe.
DeleteVery interesting talk and works, Thank you. My question is about WEB-NLP-WEB, you said Google should be aware about using his own translation to improve his system , but we can see that Google allow people to contribute to these translations (there is a lot errors even from people), is it the good way to improve the power on NLP or we should let this task only for the linguistic ?
ReplyDeleteVous parlez du web comme quelque chose d'important pour le développement du TLN. Certaines personnes vont plus loin, et y voient une révolution scientifique. E.g. Steadman (2013) :
ReplyDelete« The big data approach to intelligence gathering allows an analyst to get the full resolution on worldwide affairs. Nothing is lost from looking too closely at one particular section of data; nothing is lost from trying to get too wide a perspective on a situation that the fine detail is lost. The algorithms find the patterns and the hypothesis follows from the data. The analyst doesn't even have to bother proposing a hypothesis any more. »
Il me semble que Zodiac serait un exemple de ce genre de chose – comme vous dites, c'est du TLN sans TLN, ou on pourrait dire que c'est du TLN sans théorie de TLN. Est-ce que ça correspond à votre expérience ?
Zodiac est un exemple d'un processus "automatique" qu'on fait sans trop y penser. Un francophone en arrive souvent à mettre ses accents correctement sans trop y penser, c'est ce qu'arrive à faire Zodiac pour le moment.
DeleteD'ailleurs Zodiac ne peut décider lorsqu'il y a ambiguïté. Par exemple, dans "je vais ou je viens", il est difficile de juger si ce devrait être "où" ou "ou".
Mais il reste plusieurs processus de plus haut niveau en NLP qui ne peuvent être traités aussi simplement.
Given that this field is progressing, would you speculate some future date range when speech recognition processing with be on par or exceed the NLP of words? As Rachel referenced, there must be contextual information in a tone of speech and visual representation in the S-O-P that would provide greater analysis.
ReplyDeleteOn Weitas comment - I think there is a huge proportion of content driven articles and blogs which not only have the immediate context of the particular web page but embedded in larger social social discourse. For these reasons, I think it is logical to see NLP as the foundation on which to extrapolate this research. If I am wrong, then a response should still address the need to classfiy this contextual information.
Context, as you point out, is a difficult problem even for humans! Given the fact that the best NLP systems have trouble with sentences that even two year old children can deal correctly with, we still have plenty of progress to do.
DeleteOf course, for things that rely on memory or on rote learning, machines are much better than humans (as for computing with numbers), but there is still plenty of work to be done by people like you,
Nice talk! I really liked the highlight that the web is still full of human language. I considered this was a good complement for all the talks we heard on. the semantic web and data mining. I also liked that fact that NLP tries to build a link between our language and that of the semantic web.
ReplyDeleteI have two -probably very naive- questions:
1. What problems can NLP adress that escape other strategies of data mining on the semantic web?
2. What about a tool that puts both strategies together? (semantic web data mining + natural language processing) Does it exist? What does it do and what results has it thrown so far?
Most data mining work with numerical values (at the best with ordinal scales). NLP tries to grasp the "meaning" of the information on the web but then the question rest how can you combine this information with other to create useful summaries that can be combined with other big data.
DeleteAs for the second question, I do not have a good successful example at hand, even though you will hear some salesmen brague that they can do it. But I have yet to see a convincing example. It will be your job to do it as you are just starting in this area.
You mentioned that NLP helps the web, and also that the web helps NLP. Another way in which the web helps NLP, is google's "did you mean ________" feature. With this feature, they're able to create a catalog of common spelling mistakes associated with each english word.
ReplyDeleteThis is a clear case of the law of numbers. Most people will type the correct spelling on the web so when a given bad spelling occurs, it will be detected because it does not correspond to a good word and the system will suggest the most common spelling that is "close enough" to it. Close enough is computed by determining the numbers of edition steps that would be needed to go from the bad to a good spelling.
DeleteThis is indeed yet another good example of the fact that the WEB can help NLP, it is mostly mechanical but it works and is quite useful.
Would it be possible to use NLP to convert natural language into a computer programming language? Such that, I could describe a web page and the computer would write the HTML/CSS for me. Or, perhaps you could describe the the computer what you wanted a program to do, and it would write the code for you.
ReplyDelete