Collection of a corpus of Dutch SMS

Oostdijk, N.; Treurniet, Maaske; Clercq, O. de; Heuvel, H. van den

Samenvatting

In this paper we present the first freely available corpus of Dutch text messages containing data originating from the Netherlands and Flanders. This corpus has been collected in the framework of the SoNaR project and constitutes a viable part of this 500-million-word corpus. About 53,000 text messages were collected on a large scale, based on voluntary donations. These messages will be distributed as such. In this paper we focus on the data collection processes involved and after studying the effect of media coverage we show that especially free publicity in newspapers and on social media networks results in more contributions. All SMS are provided with metadata information. Looking at the composition of the corpus, it becomes visible that a small number of people have contributed a large amount of data, in total 272 people have contributed to the corpus during three months. The number of women contributing to the corpus is larger than the number of men, but male contributors submitted larger amounts of data. This corpus will be of paramount importance for sociolinguistic research and normalisation studies.

Toon meer

Thema

Algemeen


Jaar	2012
Type
Taal	Nederlands

Collection of a corpus of Dutch SMS

Collection of a corpus of Dutch SMS

Samenvatting

Misschien ook interessant voor jou?

Local object names at the Wereldmuseum

Bslim : algemene monitoring Corpus den Hoorn

Toward assessing clinical trial publications for reporting transparency