View Content #18985

Contentid18985
Content Type1
TitleThe Wikipedia Corpus
Body

From http://corpus.byu.edu/

The Brigham Young University has recently released the BYU Wikipedia Corpus, which is composed of 1.9 billion words in 4.4 million articles. With this new corpus, you can now search Wikipedia in all of the ways that you can search the other corpora from BYU – word and phrase, part of speech, variable strings, synonyms, comparisons of words, collocates, and concordance lines.

Most importantly, however, with this interface you can quickly and easily create and then search personalized ''virtual corpora'' from the 4,400,000 web pages. For example, in just a few seconds you could create a corpus with 500-1,000 pages (perhaps 500,000-1,000,000 words) related to any topic. You can also modify any of these corpora – adding, deleting, or moving texts; creating groups of corpora, etc.

You can then limit your search to just that portion of Wikipedia, to see collocates or concordance lines from just that virtual corpus. You can also compare the frequency of words and phrases across these different virtual corpora. And perhaps best of all, you can quickly and easily create keyword lists for these corpora, including multi-word expressions.

So rather than having to scour the Web to find web pages for a corpus on a given topic, you can now just create a corpus from the relevant pages in Wikipedia. And then use the data from the new Wikipedia corpus to focus in on the words and phrases of that particular topic.

Access the corpus at http://corpus.byu.edu/wiki/

SourceMark Davis, Brigham Young University
Inputdate2015-02-06 15:28:41
Lastmodifieddate2015-02-09 03:15:19
ExpdateNot set
Publishdate2015-02-09 02:15:01
Displaydate2015-02-09 00:00:00
Active1
Emailed1
Isarchived0