Culturomics: Hacking The Library of Babel

|

Hacking the library

In his brilliant short story, The Library of Babel, Argentine writer Jorge Luis Borges imagined the universe as an infinite library containing all books which is perhaps overseen by an elusive Librarian. Of course, the Library inspired both hope and despair:

When it was proclaimed that the Library contained all books, the first impression was one of extravagant happiness. All men felt themselves to be the masters of an intact and secret treasure. There was no personal or world problem whose eloquent solution did not exist in some hexagon. The universe was justified, the universe suddenly usurped the unlimited dimensions of hope….

As was natural, this inordinate hope was followed by an excessive depression. The certitude that some shelf in some hexagon held precious books and that these precious books were inaccessible, seemed almost intolerable. A blasphemous sect suggested that the searches should cease and that all men should juggle letters and symbols until they constructed, by an improbable gift of chance, these canonical books.

The folks at Google are constructing a digitized library of Babel and now, working with researchers from around the world, they have given the rest of us a tool to juggle letters and symbols as a way to probe the historical mysteries of this library. The details of what they are calling culturomics are published in an article, "Quantitative Analysis of Culture Using Millions of Digitized Books," in the current issue of Science. The abstract explains:

We constructed a corpus of digitized texts containing about 4% of all books ever printed [5,195,769 digitized books].  Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

So what is culturomics good for? The researchers found the number of English words is increasing:

…we estimated the number of words in the English lexicon as 544,000 in 1900, 597,000 in 1950, and 1,022,000 in 2000…

That inventions are adopted more rapidly:

We divided a list of 154 inventions into timeresolved cohorts based on the forty-year interval in which they were first invented (1800-1840, 1840-1880, and 1880- 1920). We tracked the frequency of each invention in the nth after it was invented as compared to its maximum value, and plotted the median of these rescaled trajectories for each cohort.

The inventions from the earliest cohort (1800-1840) took over 66 years from invention to widespread impact (frequency >25% of peak). Since then, the cultural adoption of technology has become more rapid: the 1840-1880 invention cohort was widely adopted within 50 years; the 1880-1920 cohort within 27.

That fame is faster and more fleeting:

Fame comes sooner and rises faster: between the early 19th century and the mid-20th century, the age of initial celebrity declined from 43 to 29 years, and the doubling time fell from 8.1 to 3.3 years. As a result, the most famous people alive today are more famous – in books – than their predecessors. Yet this fame is increasingly short-lived: the post-peak halflife dropped from 120 to 71 years during the nineteenth
century.

The library can be probed to uncover episodes of censorship:

We probed the impact of censorship on a person's cultural influence in Nazi Germany. Led by such figures as the librarian Wolfgang Hermann, the Nazis created lists of authors and artists whose "undesirable", "degenerate" work was banned from libraries and museums and publicly burned. We plotted median usage in German for five such lists: artists (100 names), as well as writers of Literature (147), Politics (117), History (53), and Philosophy (35). We also included a collection of Nazi party members [547 names]. The five suppressed groups exhibited a decline. This decline was modest for writers of history (9%) and literature (27%), but pronounced in politics (60%), philosophy (76%), and art (56%). The only group whose
signal increased during the Third Reich was the Nazi party members [a 500% increase].

If you want to explore the new library of Babel, Google has made available its new data visualization tool, the Ngram Google Books Viewer. For example, see the Ngram for the frequency of the appearance of the word "libertarian" between 1920 and 2008:

The rise of Reason

Here is an Ngram for the phrase "global warming."

Faster than the temperature

NEXT: Al Sharpton: The Early, Homophobic Years

Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Report abuses.

  1. Borges is truly entertaining and deserves to be better known. If you don’t know it, check out the short The Cult of the Phoenix for some masterful “crude humor”.

    Of course, Google can be entertaining, too …

    1. Rich: I quite agree. I had the pleasure of hearing him lecture in New York back in the 1980s.

  2. Damn, it looks like we’ve already peaked.

  3. mr simple: Actually a lot of terms I’ve tried, e.g. sex, also appear to “peak” at the end of the series. I wonder if this has something to do with migration to new media?

  4. That global warming graph looks suspiciously like a hockey stick…

  5. “No homo” is quite widely distributed. Who would have thought?

  6. What about “oven dodgers”? How does that rate?

    1. Get your hand out of Jodi Foster’s Beaver Mel.

  7. but how will this hell help confound Leviathan? the Last Best Hopes sugartits are in the wringer & all the best cults is culled off & graphs-GRapHs!!! & the Bellamy salute is fighting w/ the phrase “under God” in my pissy assbrain & how come the preacha man has ta have hisself a nice house? thanks to whosoever turned me on to the order of the stick t’other week! it’s all in the 14th Depository…

  8. “As was natural, this inordinate hope was followed by an excessive depression.”

    Hope and Change!


    74
    If you realize that all things change,
    there is nothing you will try to hold on to.
    If you aren’t afraid of dying,
    there is nothing you can’t achieve.

    Trying to control the future
    is like trying to take the master carpenter’s place.
    When you handle the master carpenter’s tools,
    chances are that you’ll cut your hand.

    1. Trying is the first step toward failure.

  9. Interesting, but I’m baffled by that excerpt from Jorge Luis Borges. What’s the connection between libraries and hexagons?

    “The universe was justified,…”

    Brain fart.

    1. See Borges’ short story “The Library of Babel”. There’s a (possibly-illegal) copy up at http://jubal.westnet.com/hyper…..abel.html. The Library contains every possible book, where a book is a sequence of 1,312,000 characters drawn from a set of 25 characters.

      What this means is that, if the Google statistics were applied to the actual Library of Babel, all the graphs would be straight lines. Every word, phrase, or letter sequence occurs exactly the same number of times.

  10. I graphed liberty from 1500 to 2008. It’s pretty depressing.

    1. I graphed Jesus, God, Joseph of Nazareth, and Mary mother of Jesus. Jesus is on rise, but his dad is on the decline. His other dad, Joseph, seems to be undefined. Same with Mary.

      I don’t think what Google is doing here is technologically innovative. With a relational database, Wavemaker, and an Internet connection you can do the same thing in your mom’s basement.

      What would be culturally innovative is for Google to give you access to the raw data (readonly is OK) and let you see and adjust the SQL query that invokes the dataset that is being plotted.

  11. Those of us who have been interested in digital humanities for a while — before it was even realistically doable — are very excited about this. I would be even more excited if I could get patterns of word distributions within texts. Could that be next, please, Google?

  12. Borges is truly entertaining and deserves to be better known.

    He couldn’t possibly be any better known than he is. From his era, how many writers are more famous? A handful? Not even. He’s probably tied for second place.

    A few thousand people still read his books. That’s the serious-literature-guy equivalent of being, like, Metallica. He’s huge.

  13. haha, your mom.

    http://ngrams.googlelabs.com/g…..moothing=3

  14. This immediately brought to mind the librarian from All Our Yesterdays (at about 45 secs in).

Please to post comments

Comments are closed.