Tuesday 22 November 2011

Google n-grams and translation

I recently came across Google N-grams, and found myself wondering whether they could be used for translation-related research.

As the user guide explains, Google N-grams is a tool which can be used to interrogate a corpus of Google Books spanning two centuries from 1800 to 2008. The corpus is rather large: 5.2 million books, which we are told constitutes roughly 4% of all the books ever published.

There are several different language corpora: British English, UK English, other English corpora, Chinese, French, German, Spanish, Hebrew and Russian.  In another useful user guide to research using this tool, someone has christened the method 'culturomics' (not the loveliest word I have heard this year). (On the other hand, for a succession of completely lovely words, see this clip of Stephen Fry being delectable).

But I digress. One could begin by examining synonyms with different historical frequency. If we look up 'poetical' and 'poetic', for instance, the general English corpus (with a smoothing of 3) gives 1880 as the year where 'poetical' gives way in usage to 'poetic':



If we look up the same word pair in the British English corpus, we get a sense that 'poetical' hung on a bit longer in the UK, just about into the twentieth century, before being overtaken:



If we consider neologisms and when they entered the language, we learn that 'sniper' took off with the First World War:




'Translatress', on the other hand, gave it the old college try, but never really took off at all (cf. the very tiny numbers on the vertical axis):



This is a tool which encourages competition. Comparing the three great names of medieval Italian literature, the graph suggests that Petrarch was still Top Italian Poet at the beginning of the nineteenth century, but that some time after 1840 (which coincides with the general publication of Henry F. Cary's popularising translation The Vision in 1844), mentions of Dante get very much more frequent. His fame seems to peak around the turn of the twentieth century:



It would be interesting to track more carefully the appearance of the many translations of Dante and see how close a link there is with spikes on the chart.

The results for the three great Ancient tragedians, Euripides, Sophocles and Aeschylus, are even more fun. I always thought they went together like three things that always go together, but the Google Books data suggests that Aeschylus wasn't spoken of for praise or blame through the nineteenth century, despite intense interest in Ancient Greek literature. He begins to be mentioned more frequently from the turn of the century, until by 1940 the three dramatists are pretty much going hand in hand (though Euripides remains Top Tragedian).



Something to check against Peter France and Olive Classe, when time permits...

These are very crude readings, ignoring all sorts of important variables, but they seem to suggest that at the very least Google N-grams would provide some useful circumstantial data for translation history research, as well as language research more generally. I would be interested to hear from readers with suggestions of more fun searches to make in the corpus.

(c) Carol O'Sullivan, November 2011

No comments: