[转帖]The ANC Second Release

tiger

高级会员
The second release of the American National Corpus includes updated versions all of the files in the first release plus an additional 10 million new words. However, the second release uses standoff annotations to a much greater extent than did the first release. All documents are now stored logically as annotation graphs with a node set and an edge set. The node set consists of a UTF-16 character stream with an implied node between each pair of characters and at the start and end of the stream. The edge set consists of one or more XML documents that describe the annotations.

See http://americannationalcorpus.org/2ndrelease.html#.
 
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35
上面有一些Samples
For examples of the various types of data in this corpus, please review the files listed below.
 
Back
顶部