JADT 2016 - Sciencesconf.org

JADT2016

Journées internationales d'Analyse statistique des Données Textuelles

7-10 juin 2016 Nice (France)

sciencesconf.org:jadt2016:85893

This paper deals with comparable corpus building from Twitter. Especially, we focus on the task related
to relevance evaluation process of tweets. In fact, as Twitter microblog is very popular, tweets could
be considered as a new data source of comparable corpora. So, a possible way to build comparable
corpora from Twitter is to extract tweets in two selected languages and sharing a specic topic, in
order to construct a multilingual corpus. However, the problem of mining relevant tweets deals with
a real challenge: how to only extract the most relevant tweets according to a specic topic from the
huge number of collected tweets? In this respect, we propose in this paper a unsupervised machine
learning based approach to improve the quality of the collected textual data, in order to identify which
messages, i.e, tweets, address the specic topic. Several tweets representations are carried out to lter the
extracted messages. The main goal of such relevance estimation process is improving the comparability
degree between bilingual extracted tweet corpora.

Type :	:	oral
Langue du texte intégral	:	anglais
Thématiques	:	Classification des textes
Thématiques	:	Cooccurrences
Thématiques	:	Fouille de données
Thématiques	:	Analyse textuelle
Thématiques	:	TAL
Mots-Clés	:	Tweet Clusturing ; Ambiguity Estimation ; Twitter mining ; Comparable corpora construction ; Comparability

Personnes connectées : 1