Journées internationales d'Analyse statistique des Données Textuelles
7-10 juin 2016 Nice (France)
A cosine based validation measure for Document Clustering
Simona Balbi  1, *@  , Michelangelo Misuraca  2, *@  , Maria Spano  1, *@  
1 : University of Naples Federico II [Naples]  -  Site web
Corso Umberto I, 40, 80138 Napoli -  Italie
2 : Università della Calabria [Arcavacata di Rende]  (Unical)  -  Site web
Campus di Arcavacata via Pietro Bucci 87036 Arcavacata di Rende (CS) -  Italie
* : Auteur correspondant

Document Clustering is the peculiar application of cluster analysis methods on huge documentary databases. Document Clustering aims at organizing a large quantity of unlabelled documents into a smaller number of meaningful and coherent clusters, similar in content. One of the main unsolved problems in clustering literature is the lack of a reliable methodology to evaluate results, although a wide variety of validation measures has been proposed. If those measures are often unsatisfactory when dealing with numerical databases, they definitely underperform in Document Clustering. In this paper a new validation measure is proposed. After introducing the most common approaches to Document Clustering, we focus our attention on Spherical K-means, for its strict connection with the Vector Space Model, typical of Information Retrieval. Since Spherical K-means adopts a cosine-based similarity measure, we propose a validation measure based on the same criterion. The effectiveness of the new measure is shown in the frame of a comparative study involving 13 different corpora (usually used in literature for comparing different proposals), and 15 validation measures.

Personnes connectées : 1