Web Science and the Mind

Sunday, 8 June 2014

Applying Data Mining to Real-Life Crime Investigation

BENJAMIN FUNG
McGill University
School of Information Studies

Overview: Data mining has demonstrated a lot of success in many domains, from direct marketing to bioinformatics. Yet, limited research has been conducted to leverage the power of data mining in real-life crime investigation. In this presentation, I will discuss two data mining methods for crime investigation with a live software demonstration. The first method aims at identifying the true author of anonymous e-mail. The second method is a subject-based search engine that can help investigators to retrieve criminal information from a large collection of textual documents.

READINGS:

  Non-technical readings and audio clip:
     New York Times: Decoding Your E-Mail Personality
     CJAD: The Aaron Rand Show (Radio) on Cybercrime Investigation

    Technical readings:
     Iqbal, F; H. Binsalleeh, B. C. M. Fung, and M. Debbabi (2013) A unified data mining solution for authorship analysis in anonymous textual communications. Information Sciences (INS): Special Issue on Data Mining for Information Security, 231: 98-112
     Dagher G. G. and B. C. M. Fung (2013) Subject-based semantic document clustering for digital forensic investigations. Data & Knowledge Engineering (DKE), 86: 224-241

29 comments:

Instructor17 July 2014 at 12:19
Can you always assume that you will have N subjects from which to pick?

And can it not be shown that for every N subjects there will always be a most likely one -- but that does not mean any of them are guilty...
ReplyDelete
Replies
Unknown17 July 2014 at 12:25
Dear Benjamin, Thank you very much for your presentation and your humour ! I have one question : For your application AuthorMiner3.0, why don’t you use the model of conditional random fields? It has used, as a state of the art technique with high performance for extraction, labelling, classification... It’s not adapted?
ReplyDelete
Replies
Robert Thibault17 July 2014 at 12:27
80-89% seems to uncertain for use in court.

A criminal could also easily fake a writeprint. If they fake a somewhat typical writing style, someone innocent may be judged guilty. The criminal could switch writeprints for each crime and again for his personal life. Are author mining tools strong enough to see beyond someone faking a writeprint? Then again, some criminals surely won't consider this technology.
ReplyDelete
Replies
Maxwell J. D. Ramstead17 July 2014 at 12:28
Thank you for the interesting talk. These techniques seem very powerful for identifying authorship. My worry is that such techniques, if they become widely employed by crime-fighting units and eventually the larger public, might be used by cybercriminals to frame innocent people. My question for Professor Fung is whether or not this problem has been considered, and what possible recourse could be taken in such cases?
ReplyDelete
Replies
Robert Thibault17 July 2014 at 12:29
How does AuthorMiner differ from a support vector machine?

Also, how does prediction accuracy correlate with number of emails from each author?
ReplyDelete
Replies
Unknown17 July 2014 at 12:33
Dans le contexte de cybercriminalité, il est facilement d'utiliser un outil qui va formatter le courriel pour le rendre générique (par exemple, Antitode peut enlever les espaces inutiles, correcteurs d'orthographe et de syntaxe, etc.). Le criminel peut écrire ses emails de façon habituelle, mais quand il commet un crime, il peut simplement utiliser le formatteur. Même avec de tels types d'outils, est-ce que le détecteur de courriels avec auteurs similaires fonctionnerait?

Translation :
In the context of cybercriminality, it is easy to use text formatting tools such as Antitode to remove unused spaces, orthographic and syntax correctors, etc. The criminal can write emails in his usual way, but can use a formater when writing cybercriminal emails. Can you tool still detect that both authors are similar?
ReplyDelete
Replies
Unknown17 July 2014 at 12:38
I’m learning very surprisingly that writing could be so different from one person to another and that it’s possible to extract so many stylometric features.
Is it possible to use your application to imitate stylometric features of someone in order to incriminate him?
ReplyDelete
Replies
Rachel17 July 2014 at 12:43
I agree that Robby's comment that writeprints may be easy to duplicate (especially if someone is intentionally concealing his or her identity). Derek Ruths and Jennifer Golbeck both talked about how someone's online profiles tell much more about a person than he or she knows. In the same way that Professor Ruths looked at "happy birthday" tweets in order to determine a user's age, perhaps there are creative ways to find predictive qualities and extract information.

The interence engine looks similar to the talks we've had about the semantic web.
ReplyDelete
Replies
vveitas17 July 2014 at 12:52
Impressive piece of software. Is it available somewhere for download (:D)? In general I think that broader availability of these tools for the larger public (not just NSA and police) may solve at least some of related ethical problems.
ReplyDelete
Replies
Unknown17 July 2014 at 12:53
This is my first time I hear about this concept WritePrint. It's a great concept.
I would like to know more about your way of extraction rules, extraction of communities and the method about the extraction patterns. Do you believe it will be possible from you? Thank you BENJAMIN FUNG.
ReplyDelete
Replies
Louis Chartrand17 July 2014 at 13:02
Maybe just another way of putting my question: in cognitive science and similar fields, you have things like the "brain effect" or the "math effect" whereby credibility of an article is heightened by brain images or math. I feel a similar bias could come from using computer science tools in investigation or court contexts. Do you think it would be a good idea to package tools like AuthorMiner with explanations about the epistemological value of the data? (what measures mean, what they reflect about the probability of someone being the author of something, etc)
ReplyDelete
Replies
Unknown17 July 2014 at 13:20
I wonder whether different types of writing would be easier/harder to detect authorship. For example, across e-mails, academic journals, essays, novels, which are the easiest types of works to detect authorship?
ReplyDelete
Replies
Spaceweaver19 July 2014 at 12:01
I guess the same principles that are encoded in to this software can be used to create an automatic or semi automatic software tools to hide the identity of the writer. This already has precedence by criminals using letters from newspapers and magazines to compose messages that cannot be traced back to hand writing or a single typing machine.

Thanks for a fascinating exposition.
ReplyDelete
Replies
Unknown6 September 2014 at 15:22
Questions relating to potential misuses of this technology were raised in commentary. If one can be identified by the statistics of stylometry, one can be just as well fraudulently implicated in a crime. The tool Anonymouth is designed to randomize the statistics of stylometry of text so as to hide the identity of the author. Similarly, an intelligent criminal could calculate the statistics of someone else, and translate their own writing into this other persona. I think the use of this technology should not be considered fool-proof. Secondly, I can also imagine a way to further disambiguate between potential authors – the topic of the writing should be consistent with the history of the individual's interests (i.e. if I'm writing about open access, I've likely Googled what others have written on it in the past) – and also consistent with the known goals/intentions of the individual. For example, any self-incriminating information is clearly contra an individuals goals/intentions and should hence be treated as less certain.
ReplyDelete
Replies

Add comment