Journées internationales d'Analyse statistique des Données Textuelles
7-10 juin 2016 Nice (France)
Lexical compactness across genres in works by Karel Čapek
Ján Mačutek  1@  , Michaela Koščová  1@  , Radek Čech  2@  
1 : Department of Applied Mathematics and Statistics, Comenius University, Bratislava
2 : Department of Czech Language, University of Ostrava

Mačutek and Wimmer (2014) introduced a simple measure of a so-called lexical text compactness. Two sentences are considered linked if they contain one or more same content word lemmas. The measure of lexical compactness is then defined as the ratio of the number of linked sentences to the number of all pairs of sentences in a text. The method (including a possibility of statistical tests on differences between compactness of two texts) was exemplified on two short Slovak journalistic texts.

We apply this method to 59 Czech texts of 6 genres (fairytales, journalistic texts, private letters, travel books, scientific texts on aesthetics, and short stories) by Karel Čapek. The texts were taken from a Karel Čapek online project (see the references). Text length varies from 5 to 314 sentences. By choosing texts written by the same author we try to eliminate (or at least to reduce) the author‘s influence; the results thus should depend (mainly) on the genre and on the text length.

We show that the lexical compactness tends to decrease with the increasing text length, whereas the influence of the genre is not too strong. In addition, we investigate also some other properties of the links which connect sentences in the above mentioned sense. First, we focus on the development of the number of links in a text, i.e., on the dependence of the number of links on a position of the sentence in the text. There seems to be no obvious trend, the number of links oscilates quite irregularly with respect to the sentence position. However, one can observe a much more regular behaviour if lengths of links (defined as differences between positions of respective sentences in the text) are analyzed. Shorter links are preferred, while longer links occur relatively seldom.

Finally, we try to develop a text typology based on the investigated text characteristics.


Karel Čapek on-line. A common project of the Prague City Library, Institute of the Czech National Corpus (Faculty of Arts, Charles University in Prague), Společnost bratří Čapků, and Památník Karla Čapka.

Mačutek, J., Wimmer, G. (2014). A measure of lexical text compactness. In: Altmann, G., Čech, R., Mačutek, J., Uhlířová, L. (eds.), Empirical Approaches to Text and Language Analysis (pp. 132-139). Lüdenscheid: RAM-Verlag.

  • Poster
