BNC Goes XML - introduction
BNC里有重复文本存在。
A decade or more after its first appearance, the British National Corpus (BNC) is still the most widely-available general-purpose fully-annotated English language corpus and is still very widely used. Technology moves on, however, and the SGML format which we used in 1994, state of the art as it then was, is looking increasingly ancient. More significantly, SGML software is not so easy to find or deploy.
For that reason, we have long planned to re-issue the corpus in XML format. XML is close enough to SGML for the migration to be painless and automatic. Moreover the range of software available for XML is increasing day by day -- very probably even the software you are using to read this message can handle it; certainly more and more NLP related tools and resources are produced in it.
The BNC Baby sampler we produced last year was an experimental step in the direction of producing BNC-XML. We are now ready to make the big leap forward, by converting the whole corpus to XML. The plan is to complete this in the next few months and then start distribution of a third edition of the BNC.
Naturally, we would also like to take this opportunity to fix as many as possible known errors and identifiable glitches in the existing corpus. We don't have the resources to add more texts or to do a manual proofing and correction of the entire corpus, but we can (and will) fix known systematic markup errors, tidy up misclassifications, remove duplicate texts and so on. Our aim is to fix as many as possible of the errors which impair the usefulness of the BNC as a source for generalizations about the lexicon, for example where the input stream has been wrongly segmented. Because every sentence in the BNC has a unique identifier (the combination of text name and sentence number, we think that many such errors can be fixed without the need for manual intervention.
You can help us by providing us with information about errors you've already noticed. We'd also much appreciate any comments you have about overall ways of improving the BNC in its new XML guise. We have plans already in hand to address the most frequently voiced concerns (eg. "how do I get rid of the tags?") and will be posting a list of the planned changes on the new website in due course.
If you want to send us notice of specific errors and typos, please send them by email to natcorp@oucs.ox.ac.uk, preferably in a consistent format. Something like the following (for example) would be an ideal way of pointing out that the apostrophe after "horse" in s-unit number 891 of text A6B is in the wrong place:
A6B 891
FOR <w DPS>his <w NN1>horse'<c PUN>
READ <w DPS>his <w NN1>horse<c PUQ>'<c PUN>.
Reports of more general errors are also very welcome, of course, and should be sent to the same address.
BNC里有重复文本存在。
A decade or more after its first appearance, the British National Corpus (BNC) is still the most widely-available general-purpose fully-annotated English language corpus and is still very widely used. Technology moves on, however, and the SGML format which we used in 1994, state of the art as it then was, is looking increasingly ancient. More significantly, SGML software is not so easy to find or deploy.
For that reason, we have long planned to re-issue the corpus in XML format. XML is close enough to SGML for the migration to be painless and automatic. Moreover the range of software available for XML is increasing day by day -- very probably even the software you are using to read this message can handle it; certainly more and more NLP related tools and resources are produced in it.
The BNC Baby sampler we produced last year was an experimental step in the direction of producing BNC-XML. We are now ready to make the big leap forward, by converting the whole corpus to XML. The plan is to complete this in the next few months and then start distribution of a third edition of the BNC.
Naturally, we would also like to take this opportunity to fix as many as possible known errors and identifiable glitches in the existing corpus. We don't have the resources to add more texts or to do a manual proofing and correction of the entire corpus, but we can (and will) fix known systematic markup errors, tidy up misclassifications, remove duplicate texts and so on. Our aim is to fix as many as possible of the errors which impair the usefulness of the BNC as a source for generalizations about the lexicon, for example where the input stream has been wrongly segmented. Because every sentence in the BNC has a unique identifier (the combination of text name and sentence number, we think that many such errors can be fixed without the need for manual intervention.
You can help us by providing us with information about errors you've already noticed. We'd also much appreciate any comments you have about overall ways of improving the BNC in its new XML guise. We have plans already in hand to address the most frequently voiced concerns (eg. "how do I get rid of the tags?") and will be posting a list of the planned changes on the new website in due course.
If you want to send us notice of specific errors and typos, please send them by email to natcorp@oucs.ox.ac.uk, preferably in a consistent format. Something like the following (for example) would be an ideal way of pointing out that the apostrophe after "horse" in s-unit number 891 of text A6B is in the wrong place:
A6B 891
FOR <w DPS>his <w NN1>horse'<c PUN>
READ <w DPS>his <w NN1>horse<c PUQ>'<c PUN>.
Reports of more general errors are also very welcome, of course, and should be sent to the same address.