Journées internationales d'Analyse statistique des Données Textuelles
7-10 juin 2016 Nice (France)
Author name extraction in blog web pages: a machine learning approach
Lucie Dupin * , Jean-Yves Antoine  1@  , Nicolas Labroche, Agata Savary, Jean-Christophe Lavocat@
1 : Laboratoire d'Informatique de l'Université de Tours  (LI)  -  Site web
Polytech'Tours, Université François Rabelais - Tours : EA6300
64, Avenue Jean Portalis, 37200 Tours -  France
* : Auteur correspondant

Text information retrieval or extraction receives an increasing attention from scientists and companies as Big Data studies have started to integrate efficient text mining solutions on Web documents. In this context, NLP techniques aim at retrieving some propositional content in electronic documents, but also at identifying the authors of these documents. Author identification is particularly interesting for companies that track and survey individual behaviours in social networks. This paper presents research results answering such needs: the automatic extraction of author names that are explicitly mentioned in blog web pages.

Author name extraction is close to Authorship Attribution (AA), whose aim is to determine if a document was written by a candidate author whose identity is not revealed in the text. AA uses classification techniques on statistical stylistic features to identify this hidden author (Statamatos 2009). Initially dedicated to literary studies or plagiarism detection, it is now considered for social networks monitoring or law issues.

Author name extraction aims, conversely, at the identification of a proper name that designates the author in a document. It is not concerned by the sensitive ethical questions Authorship Attribution raises. The existing commercial extraction systems are limited to some heuristics on HTML tags (“author” for instance) while, to the best of our knowledge, few research has been dedicated to statistical (machine learning) and NLP points of view. In this paper, we adapt the seminal work by Sahar Changuel (2011) conducted on web pages to a new kind of documents: blog pages.

This task is harder than it looks: although blog pages are very strictly structured, the structure varies strongly from one blog platform to another. Additionally, distinguishing the author of a post from the people who left comments is not straightforward. It is also important to distinguish personal blogs from institutional and commercial ones.

Our approach follows roughly the proposal of (Changuel 2011). At first, proper names are extracted with the Standford Named Entity Recognizer. Then, named entities are annotated with 11 different features. Most of them are related to the blog structure (XML tag), but an originality of this work is to consider additionally some linguistic clues precisely to depend less heavily on blog structures. These features consider the local context around the entity (date, markers like “by”, “author”...). Named entities are also analysed syntactically (or morphosyntactically) to merge close mentions of the same person (Roosevelt, Theodore Roosevelt...). At last, the author is selected among the detected named entities by a classification process implemented on the Sci-kit learn platform. Our system uses SVM with a linear SVC kernel in order to rank the different hypotheses.

The system was trained and assessed on a corpus of English blog pages. The evaluation focused on the classification process: it was conducted on correctly identified named entities only. The results are encouraging (F1-measure : 0.93). Experiments show that the linguistic features improve the performances slightly. But an increase of the recall suggests that they are useful when the blog structure differs significantly with the training corpus.

Personnes connectées : 1