New wikipedia corpus coming from Mark Davies

本文由 navs2014-11-20 发表於 "专门用途语料库" 讨论区

  1. exciting possibilities for specialised corpora:

    In about 5-6 weeks I'll be releasing a corpus that is based on the 2 billion words (4.5 million articles) in Wikipedia, which should do most of what you want. Via the web interface, you'll be able to quickly and easily create "virtual corpora" from the 4.5 million articles, based on titles, page links, and/or page content. Each of these virtual, personalized corpora can have up to 1,000 articles and 1.2 million words.

    And then you'll be able to search within these virtual corpora (strings, n-grams, collocates, collocations, concordances, etc) , or compare word and phrase frequencies across your virtual corpora, or find keywords (including multi-word expressions) in your corpora, all from within the web interface and all within just a few seconds.

    Anyway, the corpus (and interface) is essentially done now, but I'm just working on the help files, including some tutorials that I'll place on YouTube.