Journées internationales d'Analyse statistique des Données Textuelles
7-10 juin 2016 Nice (France)
Reproducible Identification of Pragmatic Universalia in CHILDES Transcripts
Hromada Daniel  1, 2, 3, *@  
1 : Universität der Künste  (Medienhaus UdK)  -  Site web
Grunewaldstrasse 2, 10823 Berlin -  Allemagne
2 : Cognitions Humaine et ARTificielle  (CHART)  -  Site web
École Pratique des Hautes Études [EPHE], Université Paris VIII - Vincennes Saint-Denis : EA4004
Université Paris 8 2 rue de la Liberté 93526 Saint-Denis -  France
3 : Slovak University of Technology  (STUBA FEI URK)  -  Site web
Ilkovičova 3 812 19 Bratislava 1 -  Slovaquie
* : Auteur correspondant

This article presents method and results of multiple analyses of the biggest publicly available corpus of language acquisition data : Child Language Data Exchange System. The methodological aim of this article is to present a means how science can be done in a highly positivist, empiric and reproducible manner consistent with the precepts of the “Open Science” movement. Thus, a handful of simple one-liners pipelining standard GNU tools like “grep”, and “uniq” is presented - which, when applied on myriads of transcripts contained in the corpus – can potentially pave a path towards identification of statistically significant phenomena. Relative frequencies of occurrence are analyzed along age and language axes in order to help to identify certain concrete, pragmatic universalia marking different stages of linguistic ontogeny in human children. One can thus observe significant culture-agnostic decrease of laughing in child-produced speech and child-directed indo-european “motherese” occurrent between 1st and 2nd year of age; maternal increase in production of pronoun denoting 2nd person singular “you”; increase of usage of 1st person singular “I” in utterances produced by children around 3rd years of age and marked decrease of the same which takes place around 6 years of age. Other significant correlations - both intracultural between english mothers and children, as well as intercultural - are pointed down always accompanied with thorough descriptions methodology immediately reproducible on an average computer.

