DNA · ML and NLP for Investigation

INVESTIGATION • OPEN SOURCE INTELLIGENCE • EARLY WARNING

AI is important for the governance of communities and intelligent systems (agents) can play the role of crucial tools against crime. Monitoring, tracking and alerting about illegal activities require systems able to automatically discover collect evidence from documents.

This allows the automatic tracking of suspected activities, search about them in past archives, visualize the aggregated information in meaningful forms and navigate across such information ecosystem to target intelligent aggregation (knowledge) and analysis (decisions).

Technological Objective: Machine Learning driven document analysis and semantic search in the investigation domain
Reference User: DNA – Ministry of Internal Affairs
Project timeline: since 2014

Information Retrieval process

Monitoring, tracking and alerting about illegal activities require systems able to automatically discover collect evidence from documents.

The automatic extraction of domain-specific information supports the creation of semantic metadata related to concepts to investigation topics (e.g. events, locations and persons) and activities. This allows the automatic tracking of suspected activities, search about them in past archives, visualize the aggregated information in GGmeaningful forms and navigate across such information ecosystem to target intelligent aggregation (knowledge) and analysis (decisions).

Natural Language Processing techniques augmented with paradigms of Machine and Deep Learning.

Finding information of interest within the investigation domain corresponds to an Information Retrieval (or IR) process that depends on Semantic Search tools toward the information made available implicitly or explicitly by documents. This combines Natural Language Processing techniques augmented with paradigms of Machine and Deep Learning.

EXAMPLES

Extracting the Implicit Knowledge

The domain taken in analysis is extracted from judicial acts and essentially describes the relations between “subjects” and “facts” resulting from the observation of each act.

As an example, let us consider a legal domain where information about people (or Physical Subject) and events are implicitly reported in texts, such as the transcription of an interrogation report, in which a person declares information about certain facts or subjects of which he is aware, such as:

“My name is Marroni Antonella. At that age I met a dance teacher in Rome, Arancioni Pino, now deceased, with whom I sold hash in the square in Aprilia selling it to boys my age.”

From the above example, it is possible to identify some specific relationships between entities that are only implicit in the text. For example, it is possible to detect the connection between the selling of drugs, hashish, and two individuals, Marroni Antonella and Arancioni Pino. Moreover, both actors attended a square in a specific place, in the town of Aprilia.

In order to extract this implicit knowledge, a Relation Extraction engine is required to:

Detect Named Entities (NEs) referring to some specific kind of entities, such as people, places or objects relevant in the target domain (here drugs), such as “Marroni Antonella”, “Pino ARANCIONI”, “Roma”, “Aprilia”, “hashish”.
Extract the relations holding between such entities useful in the target domain, such as the relation “Sell” which holds between the entities “Pino ARANCIONI” and “hashish”, or that Marroni Antonella and Arancioni Pino “Knows”.

The availability of these semantic metadata allows providing a structure to the content of documents, to retrieve them through much more expressive mechanisms. For example, the identification within a documental base of all the mentions of an individual (a Physical Subject) and of all the other individuals who have had relations with him/her (of the type “Knows” each other) and who are in turn in relation (of the “Attend a place” type) with Entities of the Location type, allow expressing complex queries, such as the example

”Find all the documents containing information about an individual, the people who have interacted with him/her in the last month and the places they have frequented”.

It enables the definition of powerful navigation schemas across this (possibly a huge amount of) knowledge, such as the following semantic graph, where entities are nodes and relations are edges between them. The example reported beforehand can be synthesized with the following graph, which provides all the information required in the investigation.

Reveal Relation Extractor (RelExt)

In this project, such knowledge is extracted and distilled using the Reveal Relation Extractor (RelExt): it processes input texts in order to identify entities of interest to analysts together with the relationships existing among them.

The RelExt system implements Machine Learning approaches for text processing, based on neural methods such as Support Vector Machine and/or Deep Learning. Since the type of entities and relations of interest in the target domain may change across the different domains, a team of analysts identified mentions to entities and relationships of interest within the client’s documents.

The labelling of less than one hundred texts allow to deploy a system able to “read” a document collection of several thousand documents.

This material was then used to automatically derive the neural models useful to automate the semantic processing of documents internalized by the system and to define benchmarks useful for the quantitative measurement of the semantic quality of the processors. The labeling of fewer than one hundred texts allows to deployment of a system able to “read” a document collection of several thousand documents. As a consequence, several hundreds of thousands of mentions to entities and relations were automatically extracted and used to populate a Database and a Semantic Search Engine. These can be finally queried through standard query languages, such as SQL or SPARQL. This enables the straightforward implementation of powerful navigation and analysis software, such as graphical dashboards, useful to navigate in this huge amount of knowledge.

Re4act, the Reveal Crime Tracking Browser

These databases and a Semantic Search Engine can be finally queried through standard query languages, such as SQL or SPARQL. This enables the straightforward implementation of powerful navigation and analysis software.

Navigation is available as graphical dashboards, useful to navigate in this huge amount of knowledge. In the following dashboard (on the left) a community of individuals (each knowing each other) can be seen at a glance together with the place they attended or the criminal group they belong to. Moreover, it is possible to browse the specific paragraphs where these entities are mentioned consistently with the graph (in the center) or the specific document to be read from the analyst (on the right) where all discovered entities are also made explicit (on the right in the bottom).