『求助』关于BNC-baby的两个问题

chrisyang

普通会员
关于BNC-Baby的两个问题:

1. BNC-Baby 构成设计部分(http://www.natcorp.ox.ac.uk/corpus/baby/baby-des.html )说其口语部分包含30个文本,可是文本清单部分 (http://www.natcorp.ox.ac.uk/corpus/baby/thebib.html)上只能找到29个,缺失的那个文本的文件名是什么呢?
下面是那29个文本的文件名(已经做成了WST4适用的形式):
D:\BNC WORLD\Texts\K\KB\KB5
D:\BNC WORLD\Texts\K\KB\KB7
D:\BNC WORLD\Texts\K\KB\KBC
D:\BNC WORLD\Texts\K\KB\KBD
D:\BNC WORLD\Texts\K\KB\KBH
D:\BNC WORLD\Texts\K\KB\KBJ
D:\BNC WORLD\Texts\K\KB\KBP
D:\BNC WORLD\Texts\K\KB\KBW
D:\BNC WORLD\Texts\K\KC\KCC
D:\BNC WORLD\Texts\K\KC\KCF
D:\BNC WORLD\Texts\K\KC\KCU
D:\BNC WORLD\Texts\K\KC\KCV
D:\BNC WORLD\Texts\K\KD\KD0
D:\BNC WORLD\Texts\K\KD\KD1
D:\BNC WORLD\Texts\K\KD\KD3
D:\BNC WORLD\Texts\K\KD\KD7
D:\BNC WORLD\Texts\K\KD\KD8
D:\BNC WORLD\Texts\K\KD\KDD
D:\BNC WORLD\Texts\K\KD\KDF
D:\BNC WORLD\Texts\K\KD\KDJ
D:\BNC WORLD\Texts\K\KE\KE2
D:\BNC WORLD\Texts\K\KE\KE4
D:\BNC WORLD\Texts\K\KN\KNR
D:\BNC WORLD\Texts\K\KP\KP2
D:\BNC WORLD\Texts\K\KP\KP5
D:\BNC WORLD\Texts\K\KP\KP7
D:\BNC WORLD\Texts\K\KP\KPU
D:\BNC WORLD\Texts\K\KP\KPX
D:\BNC WORLD\Texts\K\KS\KSN

2. BNC-Baby网页上说,该语料库包含有128个文本(见7.2. BNC-baby delivery format at http://www.natcorp.ox.ac.uk/corpus/baby/cdifsmop.html ),可是根据网页(http://www.natcorp.ox.ac.uk/corpus/baby/thebib.html)上提供的所含文本信息,文本总数应为181个,其中spoken部分29个 academic部分30个, fiction 部分25个,newspaper部分97个。那BNC-Baby到底包含多少个文本?是128个,181个,还是182个?

另:手头有BNC Sampler的C友能否帮忙告诉我该库语料包含哪些文本?可否将其文件名上传??
 
回复: 『求助』关于BNC-baby的两个问题

The BNC-Baby has 184 txtual files:
- <file_index>
<file>A7V.xml</file>
<file>A87.xml</file>
<file>A8J.xml</file>
<file>A8W.xml</file>
<file>A95.xml</file>
<file>A9E.xml</file>
<file>A9M.xml</file>
<file>A9V.xml</file>
<file>AA4.xml</file>
<file>AAB.xml</file>
<file>AAK.xml</file>
<file>AAT.xml</file>
<file>AEA.xml</file>
<file>ALS.xml</file>
<file>AP6.xml</file>
<file>APJ.xml</file>
<file>B2E.xml</file>
<file>BMJ.xml</file>
<file>BP6.xml</file>
<file>C9C.xml</file>
<file>CAA.xml</file>
<file>CBB.xml</file>
<file>CCD.xml</file>
<file>CDH.xml</file>
<file>CEL.xml</file>
<file>CF5.xml</file>
<file>CF6.xml</file>
<file>CF7.xml</file>
<file>CF8.xml</file>
<file>CF9.xml</file>
<file>CHP.xml</file>
<file>CHR.xml</file>
<file>CL8.xml</file>
<file>CN4.xml</file>
<file>DCH.xml</file>
<file>EAP.xml</file>
<file>EBK.xml</file>
<file>EVR.xml</file>
<file>EVY.xml</file>
<file>EW4.xml</file>
<file>EX7.xml</file>
<file>F71.xml</file>
<file>F77.xml</file>
<file>F7G.xml</file>
<file>F7J.xml</file>
<file>F86.xml</file>
<file>F98.xml</file>
<file>F9M.xml</file>
<file>FA4.xml</file>
<file>FB4.xml</file>
<file>FCF.xml</file>
<file>FEJ.xml</file>
<file>FL6.xml</file>
<file>FLK.xml</file>
<file>FLS.xml</file>
<file>FLU.xml</file>
<file>FLY.xml</file>
<file>FM4.xml</file>
<file>FM7.xml</file>
<file>FMP.xml</file>
<file>FMS.xml</file>
<file>FR2.xml</file>
<file>FRY.xml</file>
<file>FSB.xml</file>
<file>FU0.xml</file>
<file>FU6.xml</file>
<file>FU7.xml</file>
<file>FU9.xml</file>
<file>FUG.xml</file>
<file>FUH.xml</file>
<file>FUT.xml</file>
<file>FUU.xml</file>
<file>FX5.xml</file>
<file>FX6.xml</file>
<file>FXR.xml</file>
<file>FY8.xml</file>
<file>FYJ.xml</file>
<file>G0A.xml</file>
<file>G0C.xml</file>
<file>G0K.xml</file>
<file>G11.xml</file>
<file>G1V.xml</file>
<file>G22.xml</file>
<file>G2R.xml</file>
<file>G3N.xml</file>
<file>G3U.xml</file>
<file>G4K.xml</file>
<file>G4N.xml</file>
<file>G5A.xml</file>
<file>G63.xml</file>
<file>GT9.xml</file>
<file>GUB.xml</file>
<file>GUL.xml</file>
<file>GV1.xml</file>
<file>GV9.xml</file>
<file>GW5.xml</file>
<file>GWA.xml</file>
<file>GX0.xml</file>
<file>GX4.xml</file>
<file>H0H.xml</file>
<file>H0S.xml</file>
<file>H13.xml</file>
<file>H47.xml</file>
<file>H4A.xml</file>
<file>H5D.xml</file>
<file>H7C.xml</file>
<file>H8W.xml</file>
<file>HDF.xml</file>
<file>HDG.xml</file>
<file>HDT.xml</file>
<file>HE3.xml</file>
<file>HE4.xml</file>
<file>HEM.xml</file>
<file>HLW.xml</file>
<file>HM4.xml</file>
<file>HXN.xml</file>
<file>HY1.xml</file>
<file>HYF.xml</file>
<file>J1L.xml</file>
<file>J1N.xml</file>
<file>J24.xml</file>
<file>J2G.xml</file>
<file>J2H.xml</file>
<file>J2J.xml</file>
<file>J3W.xml</file>
<file>J44.xml</file>
<file>J55.xml</file>
<file>J6W.xml</file>
<file>J8G.xml</file>
<file>J97.xml</file>
<file>JJA.xml</file>
<file>JJS.xml</file>
<file>JJV.xml</file>
<file>JJW.xml</file>
<file>JNG.xml</file>
<file>JNM.xml</file>
<file>JXL.xml</file>
<file>KB1.xml</file>
<file>KB2.xml</file>
<file>KB3.xml</file>
<file>KB8.xml</file>
<file>KB9.xml</file>
<file>KBF.xml</file>
<file>KBG.xml</file>
<file>KBK.xml</file>
<file>KBL.xml</file>
<file>KBU.xml</file>
<file>KBX.xml</file>
<file>KC0.xml</file>
<file>KC1.xml</file>
<file>KC2.xml</file>
<file>KC3.xml</file>
<file>KC4.xml</file>
<file>KC7.xml</file>
<file>KC8.xml</file>
<file>KCA.xml</file>
<file>KCB.xml</file>
<file>KCE.xml</file>
<file>KCG.xml</file>
<file>KCH.xml</file>
<file>KCL.xml</file>
<file>KCN.xml</file>
<file>KCT.xml</file>
<file>KCU.xml</file>
<file>KCV.xml</file>
<file>KCX.xml</file>
<file>KCY.xml</file>
<file>KD0.xml</file>
<file>KD1.xml</file>
<file>KD2.xml</file>
<file>KD3.xml</file>
<file>KD5.xml</file>
<file>KD6.xml</file>
<file>KD8.xml</file>
<file>KDH.xml</file>
<file>KDM.xml</file>
<file>KDN.xml</file>
<file>KDU.xml</file>
<file>KE3.xml</file>
<file>KP6.xml</file>
<file>KP8.xml</file>
<file>KPD.xml</file>
<file>KPG.xml</file>
<file>KST.xml</file>
</file_index>
 
回复: 『求助』关于BNC-baby的两个问题

Here is a copy of BNC Sampler documentation:

Document SAMPLER.HTM
UCREL, Lancaster University, Lancaster LA1 4YT, UK
March 1998
The British National Corpus Sampler Corpus: Explanatory documentation

1. Introduction

The BNC Sampler Corpus is a subcorpus of the British National Corpus, consisting of approximately one-fiftieth of the whole corpus, viz. two million words. The Sampler Corpus is word-class tagged, using a more detailed tagset than has been used for the BNC as a whole. It also has the advantage that all the word-class tags assigned to words have been manually checked and, where necessary, corrected. Consequently, the number of errors in word-class tagging must be very small. The advantages of the Sampler Corpus, then, are the following:
  • It has been word-class-tagged using the C7 tagset, a more detailed tagset of 135 tags (plus 12 punctuation tags), instead of the C5 tagset with 61 tags (plus punctuation tags).
  • The word-class tags have been hand-checked and corrected, so that errors are minimal.
  • There is approximately a 50%-50% division of the Sampler Corpus into written and spoken materials. This is a better balance than the 90%-10% divison of the whole BNC.
  • Although the Sampler Corpus lacks much of the detail and variety of the entire BNC, it does contain a wide and balanced sampling of texts from the BNC, so as to maintain the general text types and the proportions of general text types (apart from the unequal written/spoken division) of the BNC as a whole.
  • Experience suggests that for many research and application purposes, a small-scale BNC such as the Sampler provides is a more convenient corpus to use than the whole 100-million-word BNC.

2. The Constitution of the BNC Sampler Corpus

The Sampler Corpus consists of the following text categories. In Table 2:1, the number of words of each category is added.


Table 2:1 - BNC Sampler Corpus (2,001,394 words)
SPOKEN (990,704 words)WRITTEN (1,010,690 words)Context-Governed (496,852)Demographic (493,852) [by socio-economic class] Imaginative (231,663)Informative(779,027) Leisure (136,606)I (AB) (164,933) Drama (23,786)Pure science (32,974) Educational (80,463)II (C1) (98,700)Poetry (30,144) Applied science (117,685) Business (134,275)III (C2) (137,686)Prose Fiction (177,733) Social science (29,868) Public/Institutional (145,508) IV (DE) (92,533)World affairs (277,128) Commerce & finance (92,057) Arts (51,645) Belief & thought (43,626) Leisure (134,044)



To maintain comparability with the whole BNC, as well as the integrity of the text samples already in the BNC, it was decided to avoid dividing these documents up into smaller extracts for the purposes of the Sampler Corpus. This meant that the selection of individual texts to form part of the Sampler had to be constrained by the size of text compared with the amount of 'room' in the Sampler for a particular text type. Another constraint was that the Sampler Corpus had to be compiled in 1994, at a time when the BNC was under development, and there was virtually no access to bibliographical information regarding the contents of the BNC. The consequence was that the choice of texts could not be determined by random sampling methods, but that a member of the research team had to select texts by human inspection. Within these constraints, the texts for the Sampler were chosen so as to copy both in size and in content the varieties and proportions of text types in the whole BNC.

2.1 List of texts in the Sampler Corpus

A. Spoken

A.1 Context-governed sampling

Leisure

FL6 FLK FX5 FX6 FXR FY8 FYJ G4N G5A G63 HE3 HE4 HEM HM4 J8G

Business

F7J FLS FUG G3U H47 H5D HDF HDG HDT HLW HYF J3W J97

Educational/Informative

F71, F77, F7G, FLY, FM4, FM7, FUH, G4K, JJS

Public/Institutional

DCH, F86, FLU, FMP, FMS, FUT, FUU, H4A, J44, JJA, JJV, JJW, JNG, JNM

A.2 Demographic sampling

respondent's socio-economic class = A or B
KB8, KBU, KC4, KCV, KP6, KPG, KBK, KCB, KCH, KDU, KP8, KST, KB3, KC0, KC3, KC8

respondent's socio-economic class = C1
KBG, KCN, KD0, KB9, KD5, KBL, KD2, KDM, KE3

respondent's socio-economic class = C2
KC1, KCG, KD3, KD8, KBX, KCT, KCX, KD1, KDH, KBF, KCE, KCL, KCY, KPD

respondent's socio-economic class = D or E
KB1, KCA, KDN, KB2, KC2, KC7, KCU, KD6


B. Written

B.1 Imaginative

Drama FU6

Poetry CHX, F9M, G11, G1V

Prose Fiction AEA, ALS, CCD, CHR, FRY, FSB, G0A, GUL, GV9, GW5, GWA, J2G


B.2 Informative

Pure science
FU0, FU9, J2H, J2J

Applied science
CF5, CF7, CF8, CL8, EAP, F98, FA4, FR2, G0K, G3N, H0H, H0S

Belief and thought
CBB, EBK, EVR, GX0

Commerce and finance
AP6, CEL, EVY, FEJ, G0C, GX4, HY1, J24, J6W

Arts
CF6, CN4, J1L, J55

Community and Social science
APJ, EX7, FCF, H8W, JXL

World affairs
A7V, A87, A8J, A8W, A95, A9E, A9M, A9V, AA4, AAB, AAK, AAT, B2E, BMJ, CHP,
EW4, FB4, FU7, G2R, GT9, H7C, HXN

Leisure
BP6, C9C, CAA, CDH, CF9, G22, GUB, GV1, H13, J1N


3. Further information
Detailed information on the word-class tagging of the BNC Sampler Corpus is given in a separate document, CLAWS-C7.
 
回复: 『求助』关于BNC-baby的两个问题

BNC sampler and BNCbaby are different products/subcorpora taken out of BNC.

BNC spoken component in BNCbady (demographically sampled) has 30 texts:
filenames listed below:
kb5.xml
kb7.xml
kbc.xml
kbd.xml
kbh.xml
kbj.xml
kbp.xml
kbw.xml
kcc.xml
kcf.xml
kcu.xml
kcv.xml
kd0.xml
kd1.xml
kd3.xml
kd7.xml
kd8.xml
kdd.xml
kdf.xml
kdj.xml
ke2.xml
ke4.xml
knr.xml
kp2.xml
kp5.xml
kp7.xml
kpu.xml
kpx.xml
ksn.xml
ksw.xml

The entire BNCbaby consists of 182 texts.

The BNC sampler has 184 files.
 
回复: 『求助』关于BNC-baby的两个问题

The BNC sampler has 184 files.

Plz find the filenames in the attached text files.
 

附件

  • Context Governed.TXT
    255 bytes · 浏览: 31
  • Demographic.TXT
    235 bytes · 浏览: 13
  • Imaginative.TXT
    80 bytes · 浏览: 21
  • Informative.TXT
    350 bytes · 浏览: 21
回复: 『求助』关于BNC-baby的两个问题

Many thanks to Dr. Xiao and Dr. Xu! But how come some of the 30 spoken filenames provided by Dr. Xu are not included in the list posted by Dr. Xiao? After reading Dr. Xu's attachments carefully, I think the 184 filenames mentioned by Richard should be the exact components of BNC Sampler instead of BNC-baby. Am I right? And the only filename not covered at http://www.natcorp.ox.ac.uk/corpus/baby/thebib.html should be "KSW".
 
回复: 『求助』关于BNC-baby的两个问题

I think the 184 filenames mentioned by Richard should be the exact components of BNC Sampler instead of BNC-baby. Am I right?

I think you are right.

One thing is clear: BNC sampler contains of two folders/subcropora-spoken (context and demo) and written (imagin and inform), 2 million words; while BNCbaby consists of four folders/subcopora-aca, dem, fic, and news, 4 million words.

http://www.natcorp.ox.ac.uk/corpus/baby/BNCintro.html

http://www.natcorp.ox.ac.uk/corpus/index.xml.ID=products#sampler
 
回复: 『求助』关于BNC-baby的两个问题

You are quite right. the 184 files were for the sampler - I went to the wrong folder! Apologies.
 
回复: 『求助』关于BNC-baby的两个问题

I'm most appreciative of your timely response and patient explanation!
 
Back
顶部