Journées internationales d'Analyse statistique des Données Textuelles
7-10 juin 2016 Nice (France)
Features Extraction To Improve Comparable Tweet corpora Building
Malek Hajjem  1, *@  , Chiraz Latiri  1@  
1 : LIPAH, Université Tunis El Manar
* : Auteur correspondant

This paper deals with comparable corpus building from Twitter. Especially, we focus on the task related
to relevance evaluation process of tweets. In fact, as Twitter microblog is very popular, tweets could
be considered as a new data source of comparable corpora. So, a possible way to build comparable
corpora from Twitter is to extract tweets in two selected languages and sharing a specic topic, in
order to construct a multilingual corpus. However, the problem of mining relevant tweets deals with
a real challenge: how to only extract the most relevant tweets according to a specic topic from the
huge number of collected tweets? In this respect, we propose in this paper a unsupervised machine
learning based approach to improve the quality of the collected textual data, in order to identify which
messages, i.e, tweets, address the specic topic. Several tweets representations are carried out to lter the
extracted messages. The main goal of such relevance estimation process is improving the comparability
degree between bilingual extracted tweet corpora.

Personnes connectées : 2