Sunday 8 June 2014

Applying Data Mining to Real-Life Crime Investigation

BENJAMIN FUNG 
McGill University
School of Information Studies

OverviewData mining has demonstrated a lot of success in many domains, from direct marketing to bioinformatics. Yet, limited research has been conducted to leverage the power of data mining in real-life crime investigation. In this presentation, I will discuss two data mining methods for crime investigation with a live software demonstration. The first method aims at identifying the true author of anonymous e-mail. The second method is a subject-based search engine that can help investigators to retrieve criminal information from a large collection of textual documents.

READINGS:
  Non-technical readings and audio clip:
     New York Times: Decoding Your E-Mail Personality
     CJAD: The Aaron Rand Show (Radio) on Cybercrime Investigation

    Technical readings:
     Iqbal, F; H. Binsalleeh, B. C. M. Fung, and M. Debbabi 
(2013) A unified data mining solution for authorship analysis in anonymous textual communicationsInformation Sciences (INS): Special Issue on Data Mining for Information Security, 231: 98-112
     Dagher 
G. G. and B. C. M. Fung (2013) Subject-based semantic document clustering for digital forensic investigations. Data & Knowledge Engineering (DKE), 86: 224-241

29 comments:

  1. Can you always assume that you will have N subjects from which to pick?

    And can it not be shown that for every N subjects there will always be a most likely one -- but that does not mean any of them are guilty...

    ReplyDelete
    Replies
    1. In our problem definition, we assume that there are N candidates.

      In addition to identifying the most plausible author, our method also estimates the confidence level. One of the factors for calculating the confidence level is the difference between the score the most plausible author and the score of the runner-up. If the difference is small, then the confidence level is low. This is just like a warning to the user. It is up to the user to trust the result or not.

      Delete
  2. Dear Benjamin, Thank you very much for your presentation and your humour ! I have one question : For your application AuthorMiner3.0, why don’t you use the model of conditional random fields? It has used, as a state of the art technique with high performance for extraction, labelling, classification... It’s not adapted?

    ReplyDelete
  3. 80-89% seems to uncertain for use in court.

    A criminal could also easily fake a writeprint. If they fake a somewhat typical writing style, someone innocent may be judged guilty. The criminal could switch writeprints for each crime and again for his personal life. Are author mining tools strong enough to see beyond someone faking a writeprint? Then again, some criminals surely won't consider this technology.

    ReplyDelete
    Replies
    1. I agree with you Robert, criminals can find a trick to escape if they know how it’s work…it’s not stronger than biological footprint , but it can be used to clarify some criminals cases

      Delete
    2. Given two candidates, our method can get up to 98% accuracy. Still, it is not good enough for criminal case. Our method may be useful in the following scenarios:

      - Civil case (which has lower requirement)
      - Intelligence gathering stage. The result of our method may lead the investigator to further look into a certain direction in the intelligence gathering stage. The result may not necessarily be used as presentable evidence.

      Delete
  4. Thank you for the interesting talk. These techniques seem very powerful for identifying authorship. My worry is that such techniques, if they become widely employed by crime-fighting units and eventually the larger public, might be used by cybercriminals to frame innocent people. My question for Professor Fung is whether or not this problem has been considered, and what possible recourse could be taken in such cases?

    ReplyDelete
    Replies
    1. This is an interesting question, but honestly I cannot do much. It is similar to the situation that a company has created a very useful swiss knife; the company cannot really control how people will use it. However, your question bring an interesting question: Can AuthorMiner differentiate the true author and the impersonator? We haven't tried.

      Delete
  5. How does AuthorMiner differ from a support vector machine?

    Also, how does prediction accuracy correlate with number of emails from each author?

    ReplyDelete
    Replies
    1. One of the slides answered my second question to an extent.
      However, the graphs went from 50-100 pieces of writing analyzed.
      What happens to accuracy when analyzing between 1-50 pieces of writing.

      Delete
    2. The accuracy will drop as the number of samples decreases. Yet, in practice, it is not too difficult to collect >50 e-mails from a person, especially if the investigator has obtained a search warrant.

      Delete
  6. Dans le contexte de cybercriminalité, il est facilement d'utiliser un outil qui va formatter le courriel pour le rendre générique (par exemple, Antitode peut enlever les espaces inutiles, correcteurs d'orthographe et de syntaxe, etc.). Le criminel peut écrire ses emails de façon habituelle, mais quand il commet un crime, il peut simplement utiliser le formatteur. Même avec de tels types d'outils, est-ce que le détecteur de courriels avec auteurs similaires fonctionnerait?

    Translation :
    In the context of cybercriminality, it is easy to use text formatting tools such as Antitode to remove unused spaces, orthographic and syntax correctors, etc. The criminal can write emails in his usual way, but can use a formater when writing cybercriminal emails. Can you tool still detect that both authors are similar?

    ReplyDelete
  7. I’m learning very surprisingly that writing could be so different from one person to another and that it’s possible to extract so many stylometric features.
    Is it possible to use your application to imitate stylometric features of someone in order to incriminate him?

    ReplyDelete
    Replies
    1. In some cases, the court / the law enforcement agencies do hire linguistic (writing style) experts to serve as expert witness. We do not claim that our method can replace the writing style experts, but our method can improve the efficiency. In any case, the results generated by our tool should be verified by a human expert, so the user is still the one making the final decision.

      In addition to identifying the most plausible author, our tool can also estimate the confidence of the result. If the confidence level is low, then the user may not want to use the result.

      Delete
  8. I agree that Robby's comment that writeprints may be easy to duplicate (especially if someone is intentionally concealing his or her identity). Derek Ruths and Jennifer Golbeck both talked about how someone's online profiles tell much more about a person than he or she knows. In the same way that Professor Ruths looked at "happy birthday" tweets in order to determine a user's age, perhaps there are creative ways to find predictive qualities and extract information.

    The interence engine looks similar to the talks we've had about the semantic web.

    ReplyDelete
    Replies
    1. Perhaps we could determine the writeprint of emails the criminal receives to learn about them and their social or criminal network.

      Delete
    2. These could be nice research directions for my future students.

      Delete
  9. Impressive piece of software. Is it available somewhere for download (:D)? In general I think that broader availability of these tools for the larger public (not just NSA and police) may solve at least some of related ethical problems.

    ReplyDelete
    Replies
    1. We are preparing a journal paper. The full algorithm will be given in the paper.

      I agree with many attendees. This tool is a double-edged sword. It really depends on how the one uses it.

      In the recent past, I have received numerous requests for authorship analysis. Most of them are victims of consistent harassment via anonymous e-mails. This tool helps them identify (with certain confidence level) the most plausible author among the candidates in the victim's mind.

      Delete
  10. This is my first time I hear about this concept WritePrint. It's a great concept.
    I would like to know more about your way of extraction rules, extraction of communities and the method about the extraction patterns. Do you believe it will be possible from you? Thank you BENJAMIN FUNG.

    ReplyDelete
    Replies
    1. AuthorMiner 1.0 is described in this journal article:

      F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi. A unified data mining solution for authorship analysis in anonymous textual communications. Information Sciences (INS): Special Issue on Data Mining for Information Security, 231:98-112, May 2013. Elsevier.


      We are preparing another article for AuthorMiner 3.0, which is much more accurate than 1.0. It is under revision. Stay tuned!

      Delete
  11. Maybe just another way of putting my question: in cognitive science and similar fields, you have things like the "brain effect" or the "math effect" whereby credibility of an article is heightened by brain images or math. I feel a similar bias could come from using computer science tools in investigation or court contexts. Do you think it would be a good idea to package tools like AuthorMiner with explanations about the epistemological value of the data? (what measures mean, what they reflect about the probability of someone being the author of something, etc)

    ReplyDelete
    Replies
    1. Sorry, I don't quite get what "credibility of an article is heightened by brain images or math" means. Would you please elaborate? Thank you.

      Delete
    2. People have done it better than I could, so I'll post this link instead:
      http://scienceblogs.com/cognitivedaily/2008/06/04/whats-more-convincing-than-tal/

      Basically, if there's a brain image in an article, people simply lend it more credibility.

      Some argue there's a similar effect when including math equations in some fields of social science.

      Delete
  12. I wonder whether different types of writing would be easier/harder to detect authorship. For example, across e-mails, academic journals, essays, novels, which are the easiest types of works to detect authorship?

    ReplyDelete
    Replies
    1. Our focus so far is on e-mails. However, we believe that authorship analysis on academic journals, essays, and novels are easier than e-mails.

      Delete
  13. I guess the same principles that are encoded in to this software can be used to create an automatic or semi automatic software tools to hide the identity of the writer. This already has precedence by criminals using letters from newspapers and magazines to compose messages that cannot be traced back to hand writing or a single typing machine.

    Thanks for a fascinating exposition.

    ReplyDelete
    Replies
    1. We have employed a very large number of stylometric features, from tracking the number of question marks to vocabulary richness, from character n-gram to parts-of-speech tags. Well, it is possible that an author can use our tools to first identify his own writeprint, and then hide it. Yet, another (probably less identifiable) writeprint will merge, and then he can hide it again. By repeating this process, he can probably hide his writing styles. This is possible, but it is also very difficult.

      This is a nice research problem though.

      Delete
  14. Questions relating to potential misuses of this technology were raised in commentary. If one can be identified by the statistics of stylometry, one can be just as well fraudulently implicated in a crime. The tool Anonymouth is designed to randomize the statistics of stylometry of text so as to hide the identity of the author. Similarly, an intelligent criminal could calculate the statistics of someone else, and translate their own writing into this other persona. I think the use of this technology should not be considered fool-proof. Secondly, I can also imagine a way to further disambiguate between potential authors – the topic of the writing should be consistent with the history of the individual's interests (i.e. if I'm writing about open access, I've likely Googled what others have written on it in the past) – and also consistent with the known goals/intentions of the individual. For example, any self-incriminating information is clearly contra an individuals goals/intentions and should hence be treated as less certain.

    ReplyDelete