Friday, 10 September 2010

Corpus of Historical American English

This came round on a distribution list and I thought it might be handy for those of you who use corpora. (I see they have Spanish and Portuguese corpora too on the site).

We are pleased to announce the release of the 400 million word Corpus of Historical American English (1810-2009). The corpus has been funded by a generous grant from the US National Endowment for the Humanities (NEH), and it is freely available at COHA is the largest structured corpus of historical English, and it contains more than 100,000 texts from fiction, popular magazines, newspapers, and non-fiction books, with the same genre balance decade by decade from the 1810s-2000s.

COHA is also related to other large corpora that we have created or modified, including the 410 million word Corpus of Contemporary American English (COCA), the 100 million word TIME Magazine Corpus (1920s-2000s), the 100 million word British National Corpus (our architecture and interface), the 100 million word NEH-funded Corpus del Español (1200s-1900s), and the 45 million word NEH-funded Corpus do Português (1300s-1900s). For information on these corpora, see

COHA allows you to quickly and easily search the 400 million words of text from the 1810s-2000s to see how words, phrases and grammatical constructions have increased or decreased in frequency, how words have changed meaning over time, and how stylistic changes have taken place in the language. Users can see the overall (normalized) frequency by decade and year, as well as the frequency of each matching string, by decade.

The following are just a small sample of an unlimited number of queries, but they should give some idea of what the corpus can do.

* Lexical change: the rise and fall of words and phrases like the following:
 - (decrease since the 1800s): bosom, folly, grieved, bestow*, quaint, beauteous, fellow, sublime, lad, many a time, of no little, for (conj)
  - (an increase and then decrease): mustn't, naughty, boyish, agog, toddle, far-out, famed, wangle, swell (adj), lousy
  - (an increase to the present time): a lot of, unleash, sexual, calm down, screw up, freak out, mommy, skills, frustrating
  - (words reflecting historical and cultural shifts): emancipation, steamship, telegraph, flapper*, fascis*, teenage*, communis*, global warming

* Stylistic change (which gives the flavor of a different time period). Examples from the 1800s, which have decreased since then, are: [so ADJ as to V] (so good as to show me), [PRON be but] (they are but the last examples), [have quite V-ed] (until she had quite finished), [NOUN be that of] (her dress was that of a beggar), or [a most ADJ NOUN] (a most helpful child).

* Morphological change: which show how word roots, prefixes, and suffixes have been used over time, including comparisons between different periods, such as -heart- (1800s noble-hearted, 1900s heart-stopping), home- (1800s homebred, 1900s homeowner), or -able adjectives (1800s placable, 1900s predictable).

* Syntactic change (since the corpus is tagged and lemmatized), like [end up V-ing], [going to V], [V PRON into V-ing] (talked them into going), phrasal verbs with [up] (make up, show up), post-verbal negation with [need] (needn't mention), the 'get' passive (get hired), sentence-initial 'hopefully', and semi-modals like [need to] and [have to].

* Semantic change: how the meaning or usage of words have changed over time, by looking at changes in collocates (co-occurring words), like [sexual, gay, chip, engine, or web]. This can also signal cultural changes over time, such as nouns used with [woman] in the 1930s-50s compared to the 1960s-80s (fabrics, hips // liberation, abortion), or nouns used with [problem] in the 1810s-1920s compared to the 1920s-2000s (railway, trust // drugs, pollution).

 * Lexical change (again): users can also have the corpus generate a list of words that were used more in one period than another, even when they don't know what the specified words might be. For example, the corpus can generate lists of verbs in the 1970s-2000s compared to the 1930s-1960s (download, recycle // effectuate, redound), adjectives in the 1970s-2000s and the 1930s-1960s (online, affordable // leftist, communistic), or -ly adverbs in the 1900s and the 1800s (basically, reportedly // despondingly, sportively).

As can be seen, the corpus allows research on a wide range of phenomena in 400 million words of text from the last two centuries of American English. The corpus is freely available at, and we invite you to use it for your research and teaching.

1 comment:

Alston Gray said...

What a great work! Thanks to all who invested their time, money and talents to complete such a monumental work. There is no way any individual could research phrases to this extent without the help of technology and people willing to invest their time. Congratulations to all who participated in making this available to all students interested in learning about American English and it's history.