Journées internationales d'Analyse statistique des Données Textuelles
7-10 juin 2016 Nice (France)
Cross-Linguistic Stylometric Features : A Preliminary Investigation
Patrick Juola  1, *@  , George Mikros  2, *@  
1 : EVL Lab, Duquesne University
Pittsburgh, PA 15282 -  États-Unis
2 : University of Athens, Greece  (UNIV. ATHENS)  -  Site web
University of Athens, Greece -  Grèce
* : Auteur correspondant

Authorship attribution, at JADT and elsewhere, has become a well-studied field with a well-understood stylometric methodology: one gathers a set of known documents representative of and comparable to the questioned documents, extracts a suitable feature set from the known documents and uses classification techniques to determine the author of the KD. A well-known limitation of this method is the need for comparable documents, which are often difficult to find in realistic situations.

The primary reason for this is that recent scholarship has established that higher performance can generally be obtained by using low-level and linguistically unsophisticated feature sets such as function (or common) words and/or character n-grams. These perform well both for authorship judgments as well as profiling, but do not provide direct insight into the author.

This paper proposes some potentially cross-linguistic features and investigates their stability within an individual but across languages. Using a custom corpus scraped from social media (Twitter), we identified fourteen individuals that posted in both English and Spanish. We evaluated their writings on a number of different language-independent measures, including participation in Twitter-specific social conventions (such as use of #hashtags and @mentions) as well as measures of vocabulary size and complexity. We were able to show a number of highly significant correlations, indicating a strong degree of cross-linguistic persistence in these style markers. In simpler terms, people who send long Tweets in English also do so in Spanish, people who use big words in English also do so in Spanish, people who use a varied vocabulary in English also do so in Spanish, people who use lots of hashtags in English also do so in Spanish, and so forth. We also show (via cluster analysis) that the measurements are partially independent across languages, suggesting that a sufficient number of these could create a basis for vector-space based classification.

In summary, this paper shows that certain basic stylistic regularities appear to be systematically persistent irrespective of the language of writing, and therefore that stylometric authorship attribution may be possible in a language-independent or multilingual way.

Personnes connectées : 1