Sunday, 8 June 2014

Knowledge Mining in Heterogeneous Information Networks

University of Illinois at Urbana-Champaign
Department of Computer Science 


OVERVIEW: People and informational objects are interconnected, forming gigantic, interconnected, integrated information networks.  By structuring these data objects into multiple types, such networks become semi-structured heterogeneous information networks.  Most real world applications that handle big data, including interconnected social media and social networks, medical information systems, online e-commerce systems, or database systems, can be structured into typed, semi-structured, heterogeneous information networks.  For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks.  Effective construction, exploration and analysis of large-scale heterogeneous information networks poses an interesting but critical challenge.
In this talk, we present principles, methodologies and algorithms for mining in heterogeneous social and information networks and show that mining typed, heterogeneous networks is a promising research frontier in data mining research.  Departing from many existing network models that view data as homogeneous graphs or networks, the semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data.  This heterogeneous network modeling will lead to the discovery of a set of new principles and methodologies for mining and exploring interconnected data, such as rank-based clustering and classification, meta path-based similarity search, and meta path-based link/relationship prediction.  We will also discuss our recent progress on construction of quality semi-structured heterogeneous information networks from unstructured data and point out some promising research directions.

    Yizhou Sun and Jiawei Han (2012) Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers
    Chi Wang, Marina Danilevsky, Jialu Liu, Nihit Desai, Heng Ji, and Jiawei Han, Constructing Topical Hierarchies in Heterogeneous Information Networks, Proc. 2013 IEEE Int. Conf.on Data Mining (ICDM'13), Dallas, TX, Dec. 2013


  1. Jim Hendler mentioned he wished that Google wouldn't just give him the most popular sites on a particular topic. Professor Han's SimPath works exceptionally well for long tail networks-- this algorithm would show similar small pages, not just the most popular about a subject. Perhaps Google could have a SimPath search in addition to their regular one (which seems more similar to P-PageRank)...

  2. How do you tell the system to balance coverage and completeness? In defining coverage, the two-word phrase was more optimal than the three-word phrase, but in completeness the three-word phrase was more optimal than the two-word phrase. As a human, i know that "vector machine" almost always refers to SVM, but how do you tell the system to make that decision without training or human training?

  3. Thank you for the talk!
    I am impressed by the predictive power of Professor Han’s techniques, especially their capacity to reliably predict future co-authors. My question for Professor Han is whether he thinks these techniques might have use for networking among scholars. What I have in mind is, for instance, using predictive analysis to see which authors I will probably collaborate with in the future, and then proactively contacting those authors and seeking collaboration opportunities.

  4. To predict what paper someone is going to write in the next year, you need to presuppose that there is a logical structure or link between topics. So you can predict what is usual, but it seems to me that it’s impossible to predict something new. Often what is new consists on linking things that don’t usually go together.

    1. In my opinion it’s not impossible to predict something new by using the links between topics or historic log. It stays prediction and the only change is the prediction's power.

  5. Très intéressant. Lorsqu'on utilise des statistiques pour résoudre des problèmes, on peut obtenir des résultats relativement facilement. Le problème, quand on n'a pas les résultats attendus, c'est beaucoup de travail à comprendre pourquoi nos résultats ne sont pas bons; surtout quand on travaille sur de très larges corpus en NLP.

    Translation :
    Very interesting. When we use statistics to solve information retrieval problems, we can results relatively easily. The problem is if we don’t have the expected results, it is very hard to understand what is wrong with them; especially when you work with large corpus in NLP.

  6. Dear Jiawei, Thank you very much for your great presentation ! You presented a lot of successful studies! What resists to datamining ? :) I have a question: You cited the model of conditional random fields, a model frequently cited as a state of the art model for knowledge extraction. What’s the limits of it? What we cannot waiting for from it? Is it the perfect model to combine with ontology-based data mining?

  7. Could you give more details about your comparison between Ranking and Markov model? Thank you JIAWEI HAN.

  8. I was thinking how presented methods / technologies can help Semantic Web. An example with label propagation where an information from small portion of labelled entities can be propagated to other entities via their connections in the network. It seems that this could be utilized in constructing the Semantic Web. Jim Hendler said yesterday that around 20% of we sites are already labelled with RDF metadata. Maybe this existing metadata can be propagated to unlabelled sites using these algorithms.

  9. Thanks for a fascinating and very interesting talk. These techniques seem to open many future possibilities of making sense of our complex world. What is particularly impressive is the unsupervised learning being demonstrated. Could you recommend a reading list for someone who knows almost nothing and want to get educated in the discipline you presented ? What is the connection between the presented techniques and the rising discipline of deep learning?

  10. Coming back on KERT – the keywords obtained were evidently most interesting with the full KERT method, but that's just our eyes. Is there any measure that can confirm it? Has it been tested on other corpuses?

    Thanks a lot, that was very impressive.

  11. Thanks for the intriguing talk Dr. Han.

    Is online shopping an example of a heterogeneous network? If I am looking at an item that John also looked at, the webpage may suggest I look at another item that John looked at. This network goes from Shopper (me)-->Item-->Shopper (John) and then guesses what other items I will browse based on John’s behaviour, and provides a link to facilitate this action.

    In this case, the output from analyzing the heterogeneous network influences the development of the network itself. This is reminiscent of page rank and search engine optimization.

  12. I asked my last question wrong – I presupposed you knew what I was thinking about in the heat of action.

    What I meant by the extended cognition question was rather something of the sort, what kind of interaction do you have in your data in the discovery phase? Do you come up with hypotheses right away from the start, or do you stumble on those hypotheses as you tinker with datasets?

  13. Measures of similarity for information networks and semantic clustering are reminiscent of a mind process (i.e., categorization). Both take heterogenous, multi-dimensional information and collapse it into a lower dimensional subspace so that a threshold can distinguish between objects inside and outside the category.