Culturomics: Hacking The Library of Babel

|

In his brilliant short story, The Library of Babel, Argentine writer Jorge Luis Borges imagined the universe as an infinite library containing all books which is perhaps overseen by an elusive Librarian. Of course, the Library inspired both hope and despair:

When it was proclaimed that the Library contained all books, the first impression was one of extravagant happiness. All men felt themselves to be the masters of an intact and secret treasure. There was no personal or world problem whose eloquent solution did not exist in some hexagon. The universe was justified, the universe suddenly usurped the unlimited dimensions of hope….

As was natural, this inordinate hope was followed by an excessive depression. The certitude that some shelf in some hexagon held precious books and that these precious books were inaccessible, seemed almost intolerable. A blasphemous sect suggested that the searches should cease and that all men should juggle letters and symbols until they constructed, by an improbable gift of chance, these canonical books.

The folks at Google are constructing a digitized library of Babel and now, working with researchers from around the world, they have given the rest of us a tool to juggle letters and symbols as a way to probe the historical mysteries of this library. The details of what they are calling culturomics are published in an article, "Quantitative Analysis of Culture Using Millions of Digitized Books," in the current issue of Science. The abstract explains:

We constructed a corpus of digitized texts containing about 4% of all books ever printed [5,195,769 digitized books].  Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

So what is culturomics good for? The researchers found the number of English words is increasing:

…we estimated the number of words in the English lexicon as 544,000 in 1900, 597,000 in 1950, and 1,022,000 in 2000…

That inventions are adopted more rapidly:

We divided a list of 154 inventions into timeresolved cohorts based on the forty-year interval in which they were first invented (1800-1840, 1840-1880, and 1880- 1920). We tracked the frequency of each invention in the nth after it was invented as compared to its maximum value, and plotted the median of these rescaled trajectories for each cohort.

The inventions from the earliest cohort (1800-1840) took over 66 years from invention to widespread impact (frequency >25% of peak). Since then, the cultural adoption of technology has become more rapid: the 1840-1880 invention cohort was widely adopted within 50 years; the 1880-1920 cohort within 27.

That fame is faster and more fleeting:

Fame comes sooner and rises faster: between the early 19th century and the mid-20th century, the age of initial celebrity declined from 43 to 29 years, and the doubling time fell from 8.1 to 3.3 years. As a result, the most famous people alive today are more famous – in books – than their predecessors. Yet this fame is increasingly short-lived: the post-peak halflife dropped from 120 to 71 years during the nineteenth
century.

The library can be probed to uncover episodes of censorship:

We probed the impact of censorship on a person's cultural influence in Nazi Germany. Led by such figures as the librarian Wolfgang Hermann, the Nazis created lists of authors and artists whose "undesirable", "degenerate" work was banned from libraries and museums and publicly burned. We plotted median usage in German for five such lists: artists (100 names), as well as writers of Literature (147), Politics (117), History (53), and Philosophy (35). We also included a collection of Nazi party members [547 names]. The five suppressed groups exhibited a decline. This decline was modest for writers of history (9%) and literature (27%), but pronounced in politics (60%), philosophy (76%), and art (56%). The only group whose
signal increased during the Third Reich was the Nazi party members [a 500% increase].

If you want to explore the new library of Babel, Google has made available its new data visualization tool, the Ngram Google Books Viewer. For example, see the Ngram for the frequency of the appearance of the word "libertarian" between 1920 and 2008:

Here is an Ngram for the phrase "global warming."