Publications and public deliverables

Here you can find our publications related to our activities in the COPKIT project and the pubic deliverables generated so far.

 

Publications


Finding tendencies in streaming data using Big Data frequent itemset mining

Carlos Fernandez-Basso, Abel J.Francisco-Agra, Maria J.Martin-Bautista, M.Dolores Ruiz

Abstract

The amount of information generated in social media channels or economical/business transactions exceeds the usual bounds of static databases and is in continuous growing. In this work, we propose a frequent itemset mining method using sliding windows capable of extracting tendencies from continuous data flows. For that aim, we develop this method using Big Data technologies, in particular, using the Spark Streaming framework enabling distributing the computation along several clusters and thus improving the algorithm speed. The experimentation carried out shows the capability of our proposal and its scalability when massive amounts of data coming from streams are taken into account.

 


Using Word Embeddings and Deep Learning for Supervised Topic Detection in Social Networks

Karel Gutiérrez-Batista, J.R. Campaña, Maria-Amparo Vila, Maria J. Martin-Bautista

Abstract

In this paper we show how word embeddings can be used to evaluate semantically the topic detection process in social networks. We propose to create and train a word embeddings with word2vec model to be used for text classification process. Then when the documents are classified, we use a pre-trained word embeddings and two similarity measures for semantic evaluation of the classification process. In particular, we perform experiments with two datasets of Twitter, using both bag-of-words with conventional classification algorithms and word embeddings with deep learning-based classification algorithms. Finally, we perform a benchmark and make some inferences about results.

 


Generalized Association Rules for Sentiment Analysis in Twitter

J. Angel Diaz-Garcia, M. Dolores Ruiz, Maria J. Martin-Bautista

Abstract

Association rules have been widely applied in a variety of fields over the last few years, given their potential for descriptive problems. One of the areas where the association rules have been most prominent in recent years is social media mining. In this paper, we propose the use of association rules and a novel generalization of these based on emotions to analyze data from the social network Twitter. With this, it is possible to summarize a great set of tweets in rules based on 8 basic emotions. These rules can be used to categorize the feelings of the social network according to, for example, a specific character.

 


A Comparative Analysis of Tools for Visualizing Association Rules: A Proposal for Visualising Fuzzy Association Rules

Carlos Fernadez-Basso, M. Dolores Ruiz, Maria J. Martin-Bautista

Abstract

Discovering new trends and co-occurrences in massive data is a key step when analysing social media, data coming from sensors, etc. Although, nowadays Data Mining is very useful and widely used for the industry, business and government, the main problem of application of machine learning or data mining in other fields is the interpretability and complexity of obtained results for non-expert users in computer or data science. For this reason in the KDD process one of the most important phases is the interpretation and evaluation. In the case of association rules is essential that results are interpretable for experts. One of the most useful tools for this goal is the visualization, because it clarifies the interpretation of results, being easier to understand in order to take a decision or explaining the behaviour of data.

 


Non-query based pattern mining and sentiment analysis for massive microblogging online texts

Diaz-Garcia, J.A.; Ruiz, M.D.; Martin-Bautista, M.J.

Abstract

Pattern mining has been widely studied in the last decade given its great interest for research and its numerous applications in the real world. In this paper the definition of query and non-query based systems is proposed, highlighting the needs of non-query based systems in the era of Big Data. For this, we propose a new approach of a non-query based system that combines association rules, generalized rules and sentiment analysis in order to catalogue and discover opinion patterns in the social network Twitter. Association rules have been previously applied for sentiment analysis, but in most cases, they are used once the process of sentiment analysis is finished to see which tokens appear commonly related to a certain sentiment. On the other hand, they have also been used to discover patterns between sentiments. Our work differs from these in that it proposes a non-query based system which combines both techniques, in a mixed proposal of sentiment analysis and association rules to discover patterns and sentiment patterns in microblogging texts. The obtained rules generalize and summarize the sentiments obtained from a group of tweets about any character, brand or product mentioned in them. To study the performance of the proposed system, an initial set of 1.7 million tweets have been employed to analyse the most salient sentiments during the American pre-election campaign. The analysis of the obtained results supports the capability of the system of obtaining association rules and patterns with great descriptive value in this use case. Parallelisms can be established in these patterns that match perfectly with real life events.

 


Mining Text Patterns over Fake and Real Tweets

Diaz-Garcia, J.A.; Fernandez-Basso, C.; Ruiz, M.D.; Martin-Bautista, M.J.

Abstract

With the exponential growth of users and user-generated content present on online social networks, fake news and its detection have become a major problem. Through these, smear campaigns can be generated, aimed for example at trying to change the political orientation of some people. Twitter has become one of the main spreaders of fake news in the network. Therefore, in this paper, we present a solution based on Text Mining that tries to find which text patterns are related to tweets that refer to fake news and which patterns in the tweets are related to true news. To test and validate the results, the system faces a pre-labelled dataset of fake and real tweets during the U.S. presidential election in 2016. In terms of results interesting patterns are obtained that relate the size and subtle changes of the real news to create fake news. Finally, different ways to visualize the results are provided.

 


Investigating 3D-STDenseNet for Explainable Spatial Temporal Crime Forecasting

Brian Maguire; Faisal Ghaffar

Abstract

Crime is a well-known social problem faced worldwide. With the availability of large city datasets, the scientific community for predictive policing has switched its focus from people-centric to place-centric, focusing on heterogeneous data points related to a particular geographic region in predicting crimes. Such data-driven techniques identify micro-level regions known as hotspots with high crime intensity. In this paper, we adapt the state-of-the-art spatial-temporal prediction model STDenseNetFus to predict crime in geographic regions in the presence of external factors such as a region’s demographics, seasonal events and weather. We demonstrate that STDenseNet maintains prediction performance compared to previous results on the same dataset despite significantly reduced parameter count. We further extend STDenseNetFus architecture from two-dimensional to three-dimensional convolutions and show that it further improves the prediction results. Finally we investigate the use of the DeepShap model explanation method to provide insights into the important input features effecting the model forecasts.

 


COPKIT: Technology and Knowledge for Early Warning/Early Action-Led Policing in Fighting Organised Crime and Terrorism

Raquel Pastor, Franck Mignet, Tobias Mattes, Agata Gurzawska, Holger Nitsch,David Wright

Abstract

Intelligence-led policing methods and supporting analysis tools represent the state-of-the-art approach in analysing, investigating, mitigating, and preventing crime. This chapter examines the question of how such methods and tools can address the lack of interaction between long-term high-level strategic intelligence and operational intelligence in the context of the fight against organised crime and terrorism.

First, this study argues that increased complexity of intelligence work requires new approaches extending existing methods by increasing the capability to combine intelligence analysis performed at strategic and operational levels. The new approach realises the required cross-fertilisation by the fusion and exchange of information from different data sources and the incorporation of knowledge resulting from different analysis levels. Second, the capabilities and desirable characteristics of relevant supporting tools for the new “Early Warning, Early Action” (EW/EA) approach are presented. Finally, the chapter discusses corresponding legal, ethical, and societal implications of such tools.

The presentation of the EW/EA paradigm and the related supporting tools in this chapter are based on research, inter alia, undertaken in the context of the EU-funded COPKIT project. COPKIT addresses innovative means of fighting organised crime and criminal use of ICT. The project aims to improve the analysing capabilities of LEAs not only during investigation but also preparedness, mitigation, and prevention.

 


Machine Learning Techniques for the Classification of Product Descriptions from Darknet Marketplaces

Clemens Heistracher, Franck Mignet, Sven Schlarb

Abstract

Over the past decade, the darknet has created unprecedented opportunities for trafficking in illicit goods, such as weapons and drugs, and it has provided new ways to offer crime as a service. Natural language processing techniques can be applied to find the types of goods that are traded in these markets. In this paper we present the results of evaluating state-of-the-art machine learning methods for the classification of darknet market offers.

Several embeddings, such as GloVe embeddings, Fasttext, Tensor Flow Universal Sentence Encoder, Flair’s contextual string embedding and term-frequency inverse-document-frequency (TF-IDF), as well as our domain-specific darknet embedding have been evaluated with a series of machine learning models, such as Random Forest, SVM, Naïve Bayes and Multilayer Perceptron.

To find the best combination of feature set and machine learning model for this task, the performance was evaluated on a publicly available collection covering 13 darknet markets with more than 10 million product offers. After extracting unique advertisements from the corpus, the classifier was trained on a subset with those advertisements that contain strings related to weapons. The purpose was to determine how well the classifier can distinguish between different types of advertisements which seem all to be related to weapons according to the keywords they contain.

The best performance for this classification task was achieved using the Linear Support Vector Machine model with the Tensor Flow Universal Sentence Encoder for feature extraction, resulting in a micro-f1-score of 96%.

 


Information Extraction from Darknet Market Advertisements and Forums

Clemens Heistracher, Sven Schlarb, Faisal Ghaffar

Abstract

Over the past decade, the Darknet has created unprecedented opportunities for trafficking in illicit goods, such as weapons and drugs, and it has provided new ways to offer crime as a service. Along with the possibilities of concealing financial transactions with the help of crypto currencies, the Darknet offers sellers the possibility to operate in covert. This article presents research and development outcomes of the COPKIT project which are relevant to the SECURWARE 2020 conference topics of data mining and knowledge discovery from a security perspective. It gives an overview about the methods, technologies and approaches chosen in the COPKIT project for building information extraction components with a focus on Darknet Markets. It explains the methods used to gain structured information in form of named entities, the relations between them, and events from unstructured text data contained in Darknet Market web pages.

 


Building a fuzzy sentiment dimension for multidimensional analysis in social networks

Karel Gutiérrez-Batista, M. Amparo Vila, María J. Martín-Bautista

Abstract

The continuous growth of social networks has made them one of the main information sources for researchers and companies, but at the same time, their pre-processing and analysis represent a great challenge. In this work, we create a Fuzzy Sentiment Dimension from textual data from social networks to allow sentiment analysis jointly with standard dimensions in a multidimensional model, and make easier and more flexible sentiment analysis in social networks. In particular, we use technologies such as fuzzy logic and multidimensional analysis, together with unsupervised tools for sentiment analysis. The use of unsupervised tools for sentiment analysis also allows both previously labeled and unlabeled documents to be analyzed. The performance results of the experimentation demonstrate the feasibility of the proposal.

 

Public deliverables


D4.4 Annotation Tool for the Law Enforcement Domain

This document is a report about a software deliverable that is prepared as part of task T4.5 “Expert annotation and integration of extracted information”. The purpose of this deliverable is to provide an environment where LEAs can evaluate an environment for creating annotations which are required for the supervised learning methods used in Tasks T4.3, and T4.4.

 


D8.1 Dissemination and Exploitation Plan v1

This deliverable outlines COPKIT’s Dissemination and Exploitation Plan, designing the dissemination and exploitation processes for COPKIT and the project outcomes.

 


D8.2 Communications Plan

This deliverable report defines the Communication Strategy for COPKIT, which complements the

Dissemination and Exploitation plan outlined in D8.1 and focuses on promoting and informing different audiences about the project and its results.

 


D8.3 Project Website

This document has been prepared in order to formalise the submission of the deliverable D8.3 COPKIT Project Website in the Participant Portal, although the deliverable is not a report, but the website itself.

 


D8.5 Standardization Opportunities and Action Plan

This report describes the potential for standardization, certification, or inclusion in community best practices of the various (expected) outcomes of the project (methods, processes, software…). Based on this analysis, an action plan is proposed.

 


For more information and updates follow us on TwitterLinkedIn and Facebook and feel free to contact our team at copkit@copkit.eu.