Internet

Books Are A Load of Crap

|

Has Google Books killed us with faulty metadata?

This week's view from inside the gift horse's mouth comes from linguist Geoff Nunberg, who assesses the mixed results of Google Books. His focus is on metadata giving publication dates for particular editions. These dates are crucial for doing some pretty interesting research:

Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri.

Unfortunately, says Nunberg, Google Books' metadata are "a mish-mash wrapped in a muddle wrapped in a mess." Some pretty amusing examples:  

Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, Andr√© Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine,  to name just a few.

It's never enough just to point out the problems, though. Nunberg must envision a futuristic nightmare in which misdating of Stephen King novels leads to anarchy:

This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now - Google? UNESCO? Wal-Mart? - these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

I'm a historical-error buff, fully confident my descendents will believe that Hillary Clinton was prime minister of America during World War II and that the punk rock episode of Quincy M.E. is an unimpeachable historical record. But even I have a hard time thinking that a hundred years from now, in a culture of teleportation and Smellivision and neurally implanted broadband connections, scholars will still be at the mercy of uncorrected metadata Google has put together over the past few years.

It's true there's no Moore's Law for capture. But there's also a finite body of print and ink to be captured. Has Google been studiously destroying all the originals after scanning them? Who's to say libraries and book collectors all over the world won't get into the act, now that it's clear that a project like this can be done? And even if (despite its statements to the contrary) Google proves unwilling to make corrections when Nunberg and other good citizens point out errors, isn't it probable, or inevitable, that somebody has already mirrored the whole Google Books project and will be able to create a more accurate data set?

Far be it from me to say no to a little Google Hate. But my initial experiences with Google Books have led me to say nothing but "Thanks for this good if imperfect thing that never existed before in human history." I mean, really, of all the things to worry about going into commie Labor Day Weekend.

And yet, I just know I will be worrying about bad metadata all weekend…