An overview of Broadcast News corpora


An overview of Broadcast News corpora
David Graff *
Linguistic Data Consortium, Suite 200, 3615 Market street, Philadelphia, PA 19204-2608, USA
The LDC began its first Broadcast News (BN) speech collection in the spring of 1996, facing a host of challenges including IPR negotiations with broadcasters, establishment of new transcription conventions and tools, and a compressed schedule for creation and release of speech, transcripts and in-domain language model data. The amount of
acoustic training data available for participants in the DARPA Hub4 English benchmark tests doubled from 50 h in 1996 to 100 h in 1997, and doubled again to 200 h in 1998. An additional 40 h has been made available as of the summer of 1999. The 1997 benchmark test also saw the addition of BN speech and transcripts in Spanish and Mandarin Chinese, though in lesser quantity, with 30 h of training data in each language. Supplements to the existing pronunciation lexicons in each language were also produced. More recently, the coordinated research project on topic detection
and tracking (TDT) has called for a large collection of BN speech data, totaling about 1100 h in English and 300 h in Mandarin over two phases (TDT2 and TDT3), although the level of detail and quality in the TDT transcriptions is not comparable to that of the Hub4 collections.