Sunday, 8 June 2014


Mining Patterns from Linked Data







OVERVIEW: The Web of Data (WoD) can be seen as global database made of multiple datasets. These datasets are published separately — by using new or reusing existing schemas on the Web — yet get interlinked through either direct references between data items or indirect ones, i.e., identity links between items representing the same entity. The technology underlying the WoD, called Linked Data (LD) allows for the construction of a global data graph in which data items are vertices related by edges of different nature. Entities, aka resources, as well as their links, aka properties, are globally identified through URLs. Beside this inherent graph structure, parts of the WoD can behave as a traditional, i.e., relational, database. 
    After substantial efforts on the standards for publishing and querying of LD on the Web, and lately the interlinking and cleansing of sets of LD, the next big issue is properly extracting new knowledge from the WoD. Data Mining (DM) discipline is about finding chunks of useful knowledge hidden in the data. DM methods are roughly divided into predictive ones, where past experience is analyzed in order to guess what the outcome of an unfolding situation, and descriptive ones whose aim is to provide insights into the regularities in the data without a specific goal. Mining LD is both useful and challenging for many reasons, not the least among them being the rich and complex graph structure induced by a large variety of link types, the availability of domain knowledge expressed as schemas, and even fully-blown ontologies, the heterogeneity in the modelling goals behind individual datasets, etc.
    In this talk we discuss the implications of LD for a specific branch of descriptive DM, called pattern mining. We present two different mining methods for that are complementary in many respects. The first one targets usage regularities: It analyses the consumption of resources from the WoD by the users of a specific semantic application and summarizes it as behavioural patterns. The second one mines purely descriptive patterns from a dataset of multiple resource types, which are expressed in a WoD-compliant language and therefore supports ontology design.

READINGS:
    M Rouane-Hacene, M Huchard, A Napoli, P Valtchev, Relational concept analysis: mining concept lattices from multi-relational data Annals of Mathematics and Artificial Intelligence 67 (1), 81-108, 2013
    MH Rouane, M Huchard, A Napoli, P Valtchev, A proposal for combining formal concept analysis and description logics for mining relational data Formal Concept Analysis (vol. of LNCS), 51-65, Springer, 2007
    M Adda, P Valtchev, R Missaoui, C Djeraba, A framework for mining meaningful usage patterns within a semantically enhanced web portal Proc. of the Third C* Conf. on Computer Science and Software,138-147, ACM, 2010
    M Adda, P Valtchev, R Missaoui, C Djeraba, Toward recommendation based on ontology-powered web-usage mining IEEE Internet Computing 11 (4), 45-52, 2007

29 comments:

  1. Dans le «Matching Function» qui tente de trouver la classe correspondant à un terme (Ex. Montréal -> Ville), comment fait-on pour faire la correspondance? Est-ce qu'il faut faire cette correspondance manuellement? Ou bien on doit effectuer une requête SPARQL sur LD et ainsi voir que Montréal est une Ville?

    Si tel est le cas, qu'est-ce qui se passe si nous avons une ville X ne se trouvant pas dans LD?

    ReplyDelete
    Replies
    1. L'algorithme d'appariement entre un motif (pattern) et une séquence de données LD est fondé sur l'hypothèse que le schéma pertinent est récupéré au préalable. Lorsqu'on développe une application au-dessus d'un jeux de données LD il semble raisonnable de s'assurer que leur schéma est disponible localement.

      Pour vérifier qu'un type de resource, comme Ville par exemple, est compatible avec une resource individuelle, soit Montréal, on peut soit passer par une API de manipulation de schéma RDFS (façon plus ancienne), soit par un end-point SPARQL (plus moderne).

      Dans les deux cas, un algorithme de fouille de données (data mining) qui est très soucieux de ses performances en temps, utilisera une version pré-compilé du schéma. Ceci lui permet de recevoir toutes les réponses à des requêtes de type de resource, de sous-classe, de sous-propriété, de domaine et de co-domaine de propriété, etc. en temps quai-constant. Ainsi, avant même de commencer la fouille, l'algorithme s'occupe de créer cette version pré-compilée en émettant toutes les requêtes imaginables sur le schéma et en enregistrant les réponses dans une structure de données à accès rapide.

      Pour la dernière question, si je la comprends bien, la situation ne devrait pas se produire. En réalité, l'application exploitant le jeux de LD est sensée exposer à travers son interface usager uniquement des resources qui font partie de ce jeux. Les "requêtes utilisateur" sont donc des simples clicks sur des pages Web générées à la volé depuis les données LD.

      Delete
  2. My question for Professor Valtchev is whether it is necessary for the search elements to be queried in a specific order for the data mining algorithms to function correctly, or whether the algorithms can identify and match patterns to queries regardless of the actual sequence of queries.

    ReplyDelete
    Replies
    1. I would think the algorithm could deduce a group from a sub-group. For example, if someone looks up the city Montreal, I believe these types of algorithms can assign Canada as the country.

      Delete
    2. If I got the question right, it is about the order of the requests in a user session, i.e., if the original sequence looks like {Canada, Montreal, ...} then should the algorithm first look for types of the Canada resource or it can start at Montreal?

      If this was the question, then the answer is yes, the order matters. From computational point of view, the matching problem is akin to sub-graph isomorphism (computationally expensive). Therefore, the temporal order between requests is exploited by the matching algorithm to avoid unnecessary compatibility checks. Roughly speaking, the order provide a dimension in the matching in which there is a starting point (can be both the beginning and the end of the sequence) and a end point. The matching algorithm task is then to "move" gradually with individual compatibilities (remember the blue arrows) until it either finds a complete matching or establishes that no such matching can be found (failure).
      With plain graphs, there is no such dimension :-(
      BTW, this is the reason for speaking about sequence patterns in the papers.

      Two final remarks: the moves I refer to above are done by the matching algorithm within a single sequence. They amount to moving the focus of matching from one resource (request) to the next one in the sequence. Such moves should not be mixed with the traversal of the pattern space where the overall mining method also moves, but this time from a more general to a more specific pattern (such a move is an application of one refinement operator).

      Independently, that order in the sequence may or may not be important in the application: Thus, there might be situations where two types of resources appear in arbitrary order. For instance, in my example (Museum, Hotel) might be just as frequent as (Hotel, Museum). In such a case, it would be more appropriate to look for sets of the flavour {Museum, Hotel}, i.e., disregard the order. Yet this brings us back to the complexity argument since the resulting patterns would be unconstrained graphs.

      Delete
    3. @Robert Thibault

      Not sure to see how the issue of moving from a sub-group to the group relates to the order in the sequence. But I might be missing the intended meaning of both terms.

      As to the automated discovery of links that are justified yet missing in a particular LD dataset, this is a separate issue from discovering patterns. In fact, the algorithm I sketched during the talk relied only on existing links.

      In one of the papers nevertheless, we did mention that the discovered patterns may be analyzed in order to find pairs of constantly co-occurring classes that are NOT related by a property in the ontology. We tend to see such a situation as a hint that there may be a property missing from the ontology and connecting those classes.

      Of course, suggesting a semantic relationship between classes based on frequent co-occurrences is not the same thing as discovering that there must be a link of an existing type between two individual resources.

      The only realistic way I can imagine for automating the link discovery is through the linguistic analysis of a corpus where the fact corresponding to the link (e.g., (foo:Montreal foo:locatedIn foo:Canada)) is clearly stated and therefore can be retrieved.

      The other, less realistic manner would be to have a fully blown OWL ontology over the travel data in which the property foo:locatedIn is described as transitive one, i.e., having the owl:TransitiveProperty type. Then from (foo:Montreal foo:locatedIn foo:Quebec) and (foo:Quebec foo:locatedIn foo:Canada) the ontology reasoning engine could deduce (foo:Montreal foo:locatedIn foo:Canada).

      Delete
    4. Thank you for the clarification! Very interesting!

      Delete
  3. Dear Petko, Thanks for your nice presentation! Don't you think recommender systems is the ideal paradigm to follow for hybrid cognitive computing applications with machine learning and ontologies modeling? A few arguments: a recommender system can be an incentive social machine for personally adapted learning. Ontologies provide a clear conceptual logical structure to provide computational model with concepts for cognitive science. Machine Learning provide induction and validation.

    ReplyDelete
    Replies
    1. Broad and deep question, indeed, and very topical one, on top of that!

      There is an easy answer: YES! First, the recommender systems are the type of systems that are addressing the core problem of the Web, i.e., the explosion of information. Then, recommendation on a large scale is necessarily using machine learning (statistical). Purely symbolic analytics stuff does not easily scale. But, on the other hand, purely text analytics has limits. You can beat the NLP pitfall with statistics only to a point. Putting LD and some amount of ontological knowledge should help improve the topicality of the recommendation.

      Delete
  4. The ontologies RDF is the only way to the pattern mining? Do you know others? I am interest to know more about your approach. Excelent presentation. Thank you Professor PETKO VALTCHEV.

    ReplyDelete
    Replies
    1. There is a rich literature on pattern mining, from itemsets to labeled graphs. Ontologies (OWL) and schemas (RDFS) are not the standard way to describe data to be mined. And there is a small number of recent studies on how to mine RDF data while using the available domain knowledge (from and ontology/schema). Yet, it turns out that the "ontologies" in these studies are most of the time pure taxonomies of classes without any property, hence there are no links in the data.

      And I'll be glad to send you a list of references to those if you think that'll help. As to the classical pattern mining, there are many data mining textbooks you could use. And you are welcome in the INF7710 course as well (next edition, in W15).

      Delete
  5. Thanks Petko, for this wonderful talk, use the class hierarchy and relation could be really the good way to predict pattern evolution and matching.
    My question: what we have to do if the sequence is more distant than the general pattern that we seek!!! (We miss three or four components from expected sequence) ?

    ReplyDelete
    Replies
    1. In a realistic situation, every pattern will miss some of the elements from the sequence. And it makes sense since in many, if not all, sequence there will be spurious requests that the user made while exploring locally (e.g., which hotel to pick among a set of alternatives - the user my chose to request a subset of them, so these will appear in the sequence). The goal of abstracting a pattern is to keep the essential part of the matched sequences while leaving out details (noise). The underlying assumption is that the noise will be filtered out since different in the different sequences. Conversely, if local explorations tend to repeat themselves, i.e., become sort of regularity, there will be no way to distinguish it from the essential part of the sequences so it will, and rightly so, enter into the pattern structure. After all, we are looking for regularities...

      Delete
    2. Thank you, very clear and complete answer.

      Delete
  6. La FCA a la réputation d'être assez memory-intensive. Est-ce que ça peut être un facteur limitant pour des applications dans de grosses bases de données?

    ReplyDelete
    Replies
    1. Absolument!

      Il y a des problèmes avec le volume du résultat et, dans la version de FCA avec des constructeurs des logiques de description, aussi avec le temps de calcul. La solution semble être une fouille plus focalisée (ne pas explorer tout l'espace de motifs) doublée de métriques d'intérêt plus sélectives.

      La "bonne nouvelle" est que lorsque on peut définir un seuil de support pour ce type de motos (patterns en anglais), on pourra l'appliquer à travers tous les treillis qu'on construit de façon simultanée (par exemple, sur Usager et Destination). Il permet au moins de couper la partie inférieure de chaque treillis où les motifs ne sont pas suffisamment significatifs.

      Après, il y a toujours l'artillerie lourde, soit les architectures de calcul massif de type Map/Reduce.

      Delete
    2. À lire "ce type de motifs" et non "de moto" ci-dessus.

      Delete
  7. You said that we usually use only one type of links between data, but in the example that you presented with links as attractions and hostels is not two different links? Is it possible to structure a system with many types of links? As there is many possible links between data could we consider the ontology project as having the purpose to build them all? Is it possible to realize it if we take into account that links between items are also subjective?

    ReplyDelete
    Replies
    1. 1. I probably messed the explanation about unicity of link types. In fact, my intended point was to contrast current practices in mining graph data where only one type of links is assumed (e.g., PageRank works on hyperlinks between web pages) and sets of LD on the web where many types of links co-exist. What possibly blurred that point was that I only used a single type of links ("visited") in my second set of examples. In fact in both methods, usually several types of links will be admitted.

      2. The answer to your next question is yes. In fact, if LD is to be used in the system, it will be hard to limit the schema to a single type of links.

      3. Every ontology project usually has a well defined scope that limits the set of link types to be represented. This set is determined by the goal of the project.
      Now if we speak about all links from that set of types, then the answer is yes. All links that are known to correspond to established facts should be represented in the LD dataset. Here by facts I mean relationships between the entities modelled by the LD resources: e.g., it is known that Montreal is in Canada yet in a particular dataset, the corresponding resources foo:Montreal and foo:Canada might have no foo:locatedIn link.
      In fact, completeness of links is an ambitious goal is rarely achieved from the onset. That is why the Web of Data allows for easy extension: one --in fact anyone-- can add the triple (foo:Montreal foo:locatedIn foo:Canada) even after the encompassing LD dataset has been published.

      3. Very good question. It touches on a deep property of the web of documents and also of web of data. In fact, as I stated above, anyone can publish an absurd link between two existing LD resources e.g., (foo:Montreal foo:locatedIn foo:Russia), just as one can publish a web page with outrageously unjustified claims. After some amount of time, such a link will end up in any LD dataset that comprises those resources.
      Now if you have an ontology describing the dataset and if its class and properties are well specified (e.g., using the OWL language constructs), you might notice the absurdity of the link. Indeed, chances are that the link would violate one or more of the constraints from the ontology. But if you only have a schema for the data (like the ones I showed during my talk), then the link may remain unnoticed since schema do not provide constraints.
      So yes, there will be (and probably are) links on the web of data that are not consensual. Therefore, the issue of trust (in a source of LD) is important. By retrieving data from trusted source, one should be able to avoid the above situation (or at least this is the official version thereof ;-)

      Delete
    2. Thank you so much for your answer!

      Delete
    3. An add-on to the above answer.

      You probably noticed that Prof. Han presented at least one graph mining method which mixes several types of links, e.g., paper author, scientific venue, subject. And that was what I meant by saying that very recently, the data mining community has started recognizing the multiplicity of the link types.

      On the other hand, what probably went under the radar here is that the multiple links are pre-processed by hand (e.g., in path selection for the similarity) to avoid exploring the entire set of graph paths.

      This is to be contrasted to the method I presented which is completely generic w.r.t. the ontology: it will work the same way with a genetics ontology or an ontology of MOOCS.

      Delete
  8. To what extent could we make an analogy between the semantic web and symbol grounding in humans? In humans, symbol grounding is the ability to pick out a referent in the real world. With the semantic web, we're giving machines the ability to pick out informational entities on the web (and deduce relations between these entities), by giving these entities semantic meaning which is "understandable" by machines. For example, humans have the ability to pick out, say, a museum in the real world, and conversely, thanks to linked data, a machine will be able to identify a museum on the web of data. The machine still isn't able to identify a museum in the real world, but if we transposed our real world into a world of data, then suddenly the machine is able to identify a museum within this data world.

    ReplyDelete
    Replies
    1. First, let me apologize for the late answer, I wanted to take time to think about the right way to answer this otherwise very interesting question.

      So the immediate answer is yes, to a significant extent, the mechanism of typing LD resources on the Web is analogous to using conceptual labels by humans in the extra-Web world. And even without a clear indication of a type, e.g., a resource representing a real museum could miss the type indication (remember the Web of Data is inherently incomplete), the machine could still be able to infer that such a type label holds. There are two different ways to do this, both amounting to analyzing the LD graph in the vicinity of that resource node. Indeed, you may either find a most plausible set of types by data mining (inductive), or deduce it from the knowledge provided in the schema (deductive).

      Where the analogy stops is the scope of the semantics we put into a LD type. In fact, the goal, at least so far, is not to provide the entire amount of semantics a human would put into a concept (to say). My understanding is that currently we are after a more modest objective which is to make the machines process the information on the Web in a more sensible -- to avoid the controversial "intelligent" -- manner. For instance, to tell apart homonymous entities in the user search queries on the Web (like Queen the rock band from the other meanings of the term). And for this, only a limited amount of semantics is enough.

      Delete
  9. I found the talk pretty informative; I was finally able to understand clearly the concepts of URIs, data and pattern mining. If we did an analogy between words and uris, is there a way to make uri's flexible in mening equivalence and recognize synonyms (entities that mean exactly the same thing and may be linked to the same data eventhough they are different itemset)? does that exist in data mining or can that be codified?

    ReplyDelete
    Replies
    1. A spot-on question for a talk that was intended to bridge between data mining and the Web of Data (WoD)! In fact, there are two separate questions.

      First, the URIs have been made immutable in the sense that when you give a URI to a resource you can no more change it. Yet there is a way to express synonymy on the WoD, and it is a quite fundamental one. Actually, there is no way to prevent people to create and name -- with a URI -- a new resource that represent the same entity another resource represents already. Indeed, this is a direct consequence of the distributed nature of the web: no central authority is there to tell to a LD producer that a particular resource she just created refers to an entity that has already been represented as a resource on the WoD. This is quite of a headache :-(

      The current way out is not to remove the newer resource but rather to add a triple to the WoD saying that both resources are equivalent (the exact primitive is owl:sameAs). In fact, there is a whole branch of the research on WoD that is busy designing methods for the discovery of plausible owl:sameAs links between resources of two independently created LD datasets.

      A final remark on this point is that while synonymy is inherent to the WoD, there is no such thing as a homonymy in the realm of URIs. According to the basic principles of the WoD, resolving a resource URI (e.g., in a web browser window) leads to the retrieval of useful information about the resource, there is no way that two different people could put descriptions of two different resources on the same URI. By chance, it might happen that somebody reinvents an URI that already exist (although it is advised to check an URI before starting the publishing process). Yet there is no way to replace or even double the existing resource: the only effect of publishing triples with this URI having in mind the "new" resource would be to add these triples to the previously existing ones that involve the same URI.
      As a good practice, one is advised to only use URIs for which one has access to the underlying server (named in the middle part of the URI, between 'http://' and the next '/') in order to be able to store the 'useful information' I mentioned before. That information can be machine-oriented (RDF triples) or human-oriented (HTML document).

      Delete
    2. Part TWO

      Now patterns are a different story. While they could be published on the WoD under their own URIs, usually this is not the case. Patterns have a more off-line usage, even if we might discover them from web data. Therefore, there is no point in naming them.

      On the other hand, patterns have their own 'semantics', which is roughly approximated by the set of data records from the input dataset that match the pattern. In fact, a more rigorous approach would require for all possible data records -- even those not in the input datasets -- matching the pattern to be included in its 'semantic'. By staying with the approximate semantics, it is very common for two patterns that are structurally different to have the same sets of matching data records. So you are perfectly right in asking that question.

      How is that codified? The typical pattern miners would not care, but there is a distinct class of methods that target what is called closed pattern and they are the answer to your question. In short, in the pattern space that I showed in the beginning of my talk, whenever you consider the above 'semantics' of the patterns, you get to a partition of the set of all patterns into disjoint groups (aka equivalence classes). Within a group, all the patterns share the same 'semantic' and hence are, in a sense, equivalent. Here equivalence is _relative to the input dataset_ that was used to produce the patterns (but is invariant to the frequency threshold!). For itemsets and for a row of more structured patterns, in each such group there is a maximal pattern w.r.t. structural complexity (e.g., larger itemset) and it is called the _closed_ pattern. For graph patterns, it is a bit trickier, since such classes may have more than one maximal patterns.

      Since there is no additional information in non-closed patterns w.r.t. their equivalent closed pattern, some mining algorithms will only compute the closed patterns and thus reduce the size of the output without losing information. Reducing the output of a mining task is a big issue in pattern mining since getting thousands of patterns with no clear criterion for filtering is not particularly helpful.

      BTW, FCA that I only scratched during the talk, is all about closed patterns. For instance, on p.18, the concept 2 has the pattern {LVR, NGR} (in its intent) whose 'semantics' is given by the extent {Ann, Ben}. If you look at the data, you'll see that another pattern, {NGR} has the same semantics. {NGR} is hence equivalent to {LVR, NGR}, yet a strictly smaller one. Since it is not closed (only maximas in a class are), it is not in the concept lattice. Another example of non closed is {LSX, NGR} whose closed pattern is {LSX, LVR, NGR} (both of semantic {Ben}).

      Delete
    3. This question maybe misguided, but combining the last two comments, with regard to synonymity. If it were the case that someone created a second resource that referred to the same entity, are you meaning to suggest that two, say, copies of the same picture of Barack Obama exist because someone has created a second resource of something that was already there (i.e., the first picture of barack obama)? and that the LD is slightly different? My question gets at the symbol grounding question, in that, how do you know that the two resources actually refer to the exact same picture of Barack Obama? How do you know they aren't slightly different?

      Delete
  10. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete