Some news items from the BNC, hopefully of interest to this list (please
redistribute)
1. BNC Goes XML
2. New BNC website
3. New release of Xaira
4. A postscript on licensing
1. BNC Goes XML
A decade or more after its first appearance, the British National Corpus
(BNC) is still the most widely-available general-purpose fully-annotated
English language corpus and is still very widely used. Technology moves
on, however, and the SGML format which we used in 1994, state of the art
as it then was, is looking increasingly ancient. More significantly,
SGML software is not so easy to find or deploy.
For that reason, we have long planned to re-issue the corpus in XML
format. XML is close enough to SGML for the migration to be painless and
automatic. Moreover the range of software available for XML is
increasing day by day -- very probably even the software you are using
to read this message can handle it; certainly more and more NLP related
tools and resources are produced in it.
The BNC Baby sampler we produced last year was an experimental step in
the direction of producing BNC-XML. We are now ready to make the big
leap forward, by converting the whole corpus to XML. The plan is to
complete this in the next few months and to start distribution of a
third edition of the BNC early this summer.
Naturally, we would also like to take this opportunity to fix as many as
possible known errors and identifiable glitches in the existing corpus.
We don't have the resources to add more texts or to do a manual proofing
and correction of the entire corpus, but we can (and will) fix known
systematic markup errors, tidy up misclassifications, remove duplicate
texts and so on. Our aim is to fix as many as possible of the errors
which impair the usefulness of the BNC as a source for generalizations
about the lexicon, for example where the input stream has been wrongly
segmented. Because every sentence in the BNC has a unique identifier
(the combination of text name and sentence number, we think that many
such errors can be fixed without the need for manual intervention.
You can help us by providing us with information about errors you've
already noticed. We'd also much appreciate any comments you have about
overall ways of improving the BNC in its new XML guise. We have plans
already in hand to address the most frequently voiced concerns (eg. "how
do I get rid of the tags?") and will be posting a list of the planned
changes on the new website in due course.
If you want to send us notice of specific errors and typos, please send
them by email to natcorp@oucs.ox.ac.uk, preferably in a consistent
format. Something like the following (for example) would be an ideal way
of pointing out that the apostrophe after "horse" in s-unit number 891
of text A6B is in the wrong place:
A6B 891
FOR <w DPS>his <w NN1>horse'<c PUN>.
READ <w DPS>his <w NN1>horse<c PUQ>&equo;<c PUN>.
Reports of more general errors are also very welcome, of course, and
should be sent to the same address.
Deadline for sending in reports of BNC errors and typos: 15 March 2006.
**** There will be a Prize draw for all those who contribute error ****
**** reports! Be first to get a (free!) copy of the new BNC! ***
(Yes, if you have already reported a mistake in the past, you can send
it to us again to be entered for the prize draw!)
2. New BNC Website
We (Ylva mostly) have also been working hard on bringing the BNC website
up to date. This is now also managed in XML, which makes maintaining a
consistent design easier as well as simplifying the authoring task.
Please take a look at http://www-dev.natcorp.ox.ac.uk
and give us your feedback at natcorp@oucs.ox.ac.uk -- all being well, we
will switch the current address to point to this new site within a week
or two.
3. New release of Xaira
A new release of Xaira, the software which developed out of SARA into a
general purpose XML corpus query tool, is now available for download
from http://xaira.sf.net
This (1.17) is the version we will be using to index the BNC XML
edition, and which we will distribute with it. Xaira can be used to
index any XML corpus, not just the BNC; it has also been used for XML
corpora in Chinese, Sanskrit, Hungarian, and many other languages.
Xaira will work with any kind of XML markup, not just BNC style. It
also includes a number of new features which were not possible in Sara,
notably better facilities for collocation searching and subcorpus
manipulation.
Xaira will run standalone or networked on 32 bit versions of Windows
(W2K, XP). A range of interfaces is available for other platforms: the
server has been installed on various flavours of Unix, including Mac
OSX. Simple PHP and Java clients are included, demonstrating how Xaira
can be built in to a web services architecture.
4. A Postscript on Licensing
* Xaira is open source software licenced under the GNU Public Licence.
* The BNC XML edition will be distributed under the same licensing
conditions and pricing structure as the current BNC World edition.
* If you took out a licence for the BNC World Edition within six months
of the date of release of the BNC XML Edition, you will receive a free
upgrade to the XML Edition and a new licence.
* We expect to maintain support for the BNC World Edition for six months
after the release date of the BNC XML Edition. Licences for the BNC
World edition will then start to expire.
Lou Burnard and Ylva Berglund
British National Corpus
Oxford University Computing Services
13 Banbury Rd
Oxford OX2 6NN
Email: natcorp@oucs.ox.ac.uk
Fax: +44 (0)1865 273 275
redistribute)
1. BNC Goes XML
2. New BNC website
3. New release of Xaira
4. A postscript on licensing
1. BNC Goes XML
A decade or more after its first appearance, the British National Corpus
(BNC) is still the most widely-available general-purpose fully-annotated
English language corpus and is still very widely used. Technology moves
on, however, and the SGML format which we used in 1994, state of the art
as it then was, is looking increasingly ancient. More significantly,
SGML software is not so easy to find or deploy.
For that reason, we have long planned to re-issue the corpus in XML
format. XML is close enough to SGML for the migration to be painless and
automatic. Moreover the range of software available for XML is
increasing day by day -- very probably even the software you are using
to read this message can handle it; certainly more and more NLP related
tools and resources are produced in it.
The BNC Baby sampler we produced last year was an experimental step in
the direction of producing BNC-XML. We are now ready to make the big
leap forward, by converting the whole corpus to XML. The plan is to
complete this in the next few months and to start distribution of a
third edition of the BNC early this summer.
Naturally, we would also like to take this opportunity to fix as many as
possible known errors and identifiable glitches in the existing corpus.
We don't have the resources to add more texts or to do a manual proofing
and correction of the entire corpus, but we can (and will) fix known
systematic markup errors, tidy up misclassifications, remove duplicate
texts and so on. Our aim is to fix as many as possible of the errors
which impair the usefulness of the BNC as a source for generalizations
about the lexicon, for example where the input stream has been wrongly
segmented. Because every sentence in the BNC has a unique identifier
(the combination of text name and sentence number, we think that many
such errors can be fixed without the need for manual intervention.
You can help us by providing us with information about errors you've
already noticed. We'd also much appreciate any comments you have about
overall ways of improving the BNC in its new XML guise. We have plans
already in hand to address the most frequently voiced concerns (eg. "how
do I get rid of the tags?") and will be posting a list of the planned
changes on the new website in due course.
If you want to send us notice of specific errors and typos, please send
them by email to natcorp@oucs.ox.ac.uk, preferably in a consistent
format. Something like the following (for example) would be an ideal way
of pointing out that the apostrophe after "horse" in s-unit number 891
of text A6B is in the wrong place:
A6B 891
FOR <w DPS>his <w NN1>horse'<c PUN>.
READ <w DPS>his <w NN1>horse<c PUQ>&equo;<c PUN>.
Reports of more general errors are also very welcome, of course, and
should be sent to the same address.
Deadline for sending in reports of BNC errors and typos: 15 March 2006.
**** There will be a Prize draw for all those who contribute error ****
**** reports! Be first to get a (free!) copy of the new BNC! ***
(Yes, if you have already reported a mistake in the past, you can send
it to us again to be entered for the prize draw!)
2. New BNC Website
We (Ylva mostly) have also been working hard on bringing the BNC website
up to date. This is now also managed in XML, which makes maintaining a
consistent design easier as well as simplifying the authoring task.
Please take a look at http://www-dev.natcorp.ox.ac.uk
and give us your feedback at natcorp@oucs.ox.ac.uk -- all being well, we
will switch the current address to point to this new site within a week
or two.
3. New release of Xaira
A new release of Xaira, the software which developed out of SARA into a
general purpose XML corpus query tool, is now available for download
from http://xaira.sf.net
This (1.17) is the version we will be using to index the BNC XML
edition, and which we will distribute with it. Xaira can be used to
index any XML corpus, not just the BNC; it has also been used for XML
corpora in Chinese, Sanskrit, Hungarian, and many other languages.
Xaira will work with any kind of XML markup, not just BNC style. It
also includes a number of new features which were not possible in Sara,
notably better facilities for collocation searching and subcorpus
manipulation.
Xaira will run standalone or networked on 32 bit versions of Windows
(W2K, XP). A range of interfaces is available for other platforms: the
server has been installed on various flavours of Unix, including Mac
OSX. Simple PHP and Java clients are included, demonstrating how Xaira
can be built in to a web services architecture.
4. A Postscript on Licensing
* Xaira is open source software licenced under the GNU Public Licence.
* The BNC XML edition will be distributed under the same licensing
conditions and pricing structure as the current BNC World edition.
* If you took out a licence for the BNC World Edition within six months
of the date of release of the BNC XML Edition, you will receive a free
upgrade to the XML Edition and a new licence.
* We expect to maintain support for the BNC World Edition for six months
after the release date of the BNC XML Edition. Licences for the BNC
World edition will then start to expire.
Lou Burnard and Ylva Berglund
British National Corpus
Oxford University Computing Services
13 Banbury Rd
Oxford OX2 6NN
Email: natcorp@oucs.ox.ac.uk
Fax: +44 (0)1865 273 275