Journées internationales d'Analyse statistique des Données Textuelles
7-10 juin 2016 Nice (France)
The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison.
Mauro Sarrica  1, *@  , Isabella Mingo  1@  , Bruno Mazzara  1@  , Giovanna Leone  1@  
1 : Sapienza - University of Rome
* : Auteur correspondant

The software IraMuTeQ is gaining space in social and psychological research. It is free and easy to use, it provides quality outputs, and it fits with theoretical perspectives interested in communication and social construction of knowledge. As in other forms of automatized analysis of large textual corpora, its use involves pre-treatment and modification of the original text in order to reduce complexity. As we know, lemmatization is a very delicate phase, which affects the whole strategy of analysis (from the selection of lemmas according to statistical or substantive criteria, to the extraction of organising factors). However, algorithms implemented in commercial or free software, are often performed after the grammatical tagging, using reference dictionaries, in automatic and non-transparent way to the end-users. And, apart from anecdotical evidence, it is often difficult to evaluate the reliability of the automated procedures. The aim of this paper is to compare the outcomes of the procedures implemented by IraMuTeQ with the output obtained with other well established software. We used a large corpus in Italian language on the issue of "fiscal compact", consisting of over one million occurrences, drawn from over 3000 newspaper articles published from 2012 to 2015. The same corpus was lemmatised using the procedures available in IraMuTeQ (list based) and those implemented in Taltac, Tlab, and Tree Tagger. The proximity between resulting lists of lemma produced by each software will be compared using intertextual distance. Finally, in order to examine the effects of different procedures on the textual analysis, we took the two most distant lists of lemmas and we applied a correspondence analysis to the two matrixes lemmas/newspapers.

Personnes connectées : 1