The nlGLOBE Corpus: A balanced collection of contemporary written Dutch

xujiajin · 2022-12-27

The nlGLOBE Corpus

INTRODUCTION

The nlGLOBE Corpus is a balanced collection of contemporary Dutch written texts, totaling one million words.

The text samples in the corpus were gathered and cleaned up by Jiachen Zhang, Xiaoxiao Lin, Xiao’ou Lei, Zhiyan Zheng and Yunjie Zhang at the School of European Languages and Cultures, Beijing Foreign Studies University (BFSU), China.

The online version of the nlGLOBE Corpus is available at BFSU CQPweb Corpus Portal (http://114.251.154.212/cqp/). Both user ID and passcode are ‘test’.

KEY INFORMATION

Project leader: Jiachen Zhang of the School of European Languages and Cultures, BFSU

Text collectors: Xiaoxiao Lin, Jiachen Zhang, Xiao’ou Lei, Zhiyan Zheng and Yunjie Zhang at the School of European Languages and Cultures and Mingchen Sun of the National Research Centre for Foreign Language Education, BFSU

Time of compilation: September 2021 – December 2022

Size: Approximately one million words

Language: Contemporary Dutch

Number of texts/samples: 500 samples of 2000+ words each (Short texts are pieced together to form one 2000-word text, but saved separately and marked with A, B, C etc. in the filenames.)

Versions of the corpus: Three versions, i.e. raw texts, part-of-speech annotated texts, and lemmatised texts, are available. The texts were POS tagged and lemmatised using TreeTagger.

Period: The vast majority of texts were published between 2012 and 2022. Only one of the text was published in the year 2009.

Released in: December 2022

BACKGROUND

On 29 December 2021, Jiajin Xu launched the GLOBE (Global Languages Out of BFSU Expertise) Corpus project, an initiative which aims to collect present-day written texts in all 101 languages that are taught at BFSU. The sampling frame of the Brown Corpus was followed to make the multilingual GLOBE corpus family comparable to the Brown family corpora. The immediate application of the GLOBE is meant to be corpus-based dictionary compilation. The first batch of the corpora covers about 30 languages.

Table 1. Text categories in the GLOBE Corpus.

(Adapted from https://varieng.helsinki.fi/CoRD/corpora/BROWN/basic.html)

	Genre group	Category	Content of category	#. of texts
I. Informative prose (374)	Press (88)	A	Reportage	44
B	Editorial	27
C	Review	17
General prose (206)	D	Religion	17
E	Skills, trades and hobbies	36
F	Popular lore	48
G	Belles lettres, biographies, essays	75
H	Miscellaneous	30
Learned (80)	J	Science	80
II. Imaginative prose (126)	Fiction (126)	K	General fiction	50
L	Mystery and detective fiction	12
M	Science fiction	12
N	Adventure and Western	13
P	Romance and love story	30
R	Humour	9
Total				500

The nlGLOBE Corpus is a sub-project of the BFSU-funded GLOBE Corpus projects (Ref. 2022SYLZD015 and 2022SYLPY004), whose principal investigator is Prof. Jiajin Xu at the National Research Centre for Foreign Language Education, BFSU.

Het nlGLOBE-Corpus

INLEIDING

Het nlGLOBE-Corpus is een uitgebalanceerde verzameling van hedendaagse in het Nederlands geschreven teksten, en telt in totaal één miljoen woorden.

De tekstvoorbeelden in het corpus zijn verzameld en opgeschoond door Jiachen Zhang, Xiaoxiao Lin, Xiao’ou Lei, Zhiyan Zheng en Yunjie Zhang van de School of European Languages and Cultures, Beijing Foreign Studies University (BFSU), China.

De onlineversie van het nlGLOBE Corpus is beschikbaar op BFSU CQPweb Corpus Portal (http://114.251.154.212/cqp/). Zowel de gebruikersnaam als het wachtwoord zijn “test”.

BELANGRIJKE INFORMATIE

Projectleider: Jiachen Zhang van de School of European Languages and Cultures, BFSU.

Tekstverzamelaars: Xiaoxiao Lin, Jiachen Zhang, Xiao’ou Lei, Zhiyan Zheng en Yunjie Zhang van de School of European Languages and Cultures en Mingchen Sun van het National Research Centre for Foreign Language Education, BFSU.

Periode van samenstelling: september 2021 - december 2022

Omvang: ongeveer één miljoen woorden

Taal: hedendaags Nederlands

Aantal teksten/voorbeelden: 500 voorbeelden van 2000+ woorden per voorbeeld (korte teksten zijn samengevoegd tot één tekst van 2000 woorden, maar apart opgeslagen en gemarkeerd met A, B, C, etc. in de bestandsnamen).

Versies van het corpus: er zijn drie versies beschikbaar: ruwe teksten, met part-of-speech geannoteerde teksten en gelemmatiseerde teksten. De teksten zijn POS-getagd en gelemmatiseerd met TreeTagger.

Periode: Het overgrote deel van de teksten is gepubliceerd tussen 2012 en 2022. Slechts één tekst is gepubliceerd in 2009.

Uitgebracht in: december 2022

ACHTERGROND

Op 29 december 2021 lanceerde Jiajin Xu het GLOBE (Global Languages Out of BFSU Expertise) Corpus project, een initiatief met als doel het verzamelen van hedendaagse teksten in alle 101 talen die aan BFSU worden onderwezen. Het steekproefkader van het Brown Corpus werd gebruikt opdat het meertalige GLOBE-corpus vergelijkbaar is met de Brown-corpora. Het is de bedoeling dat de primaire toepassing van GLOBE de samenstelling van woordenboeken op basis van een corpus zal zijn. De eerste reeks corpora omvat ongeveer 30 talen.

Tabel 1. Tekstcategorieën in het GLOBE-corpus.

(Adapted from https://varieng.helsinki.fi/CoRD/corpora/BROWN/basic.html)

	Genregroep	Categorie	Inhoud van categorie	Aantal teskten
I. Informatief proza (374)	Journalistiek (88)	A	Reportage	44
B	Redactioneel	27
C	Recensie	17
Algemene teksten (206)	D	Religie	17
E	Beroepen en hobbies	36
F	Volksoverlevering	48
G	Literatuur, biografieën, essays	75
H	Gevarieerd	30
Learned (80)	J	Wetenschap	80
II. Fantasierijke teksten (126)	Fictie (126)	K	Algemene fictie	50
L	Mystery en misdaadroman	12
M	Sci-fi	12
N	Avontuur en western	13
P	Romantiek	30
R	Humor	9
Totaal				500

Het nlGLOBE-Corpus is een deelproject van de door de BFSU gefinancierde GLOBE Corpus-projecten (Ref. 2022SYLZD015 en 2022SYLPY004) met als hoofdonderzoeker Prof. Jiajin Xu van het National Research Centre for Foreign Language Education, BFSU.

The nlGLOBE Corpus: A balanced collection of contemporary written Dutch

xujiajin

管理员