Books Are A Load of Crap
This week's view from inside the gift horse's mouth comes from linguist Geoff Nunberg, who assesses the mixed results of Google Books. His focus is on metadata giving publication dates for particular editions. These dates are crucial for doing some pretty interesting research:
Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri.
Unfortunately, says Nunberg, Google Books' metadata are "a mish-mash wrapped in a muddle wrapped in a mess." Some pretty amusing examples:
Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few.
It's never enough just to point out the problems, though. Nunberg must envision a futuristic nightmare in which misdating of Stephen King novels leads to anarchy:
This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now - Google? UNESCO? Wal-Mart? - these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.
I'm a historical-error buff, fully confident my descendents will believe that Hillary Clinton was prime minister of America during World War II and that the punk rock episode of Quincy M.E. is an unimpeachable historical record. But even I have a hard time thinking that a hundred years from now, in a culture of teleportation and Smellivision and neurally implanted broadband connections, scholars will still be at the mercy of uncorrected metadata Google has put together over the past few years.
It's true there's no Moore's Law for capture. But there's also a finite body of print and ink to be captured. Has Google been studiously destroying all the originals after scanning them? Who's to say libraries and book collectors all over the world won't get into the act, now that it's clear that a project like this can be done? And even if (despite its statements to the contrary) Google proves unwilling to make corrections when Nunberg and other good citizens point out errors, isn't it probable, or inevitable, that somebody has already mirrored the whole Google Books project and will be able to create a more accurate data set?
Far be it from me to say no to a little Google Hate. But my initial experiences with Google Books have led me to say nothing but "Thanks for this good if imperfect thing that never existed before in human history." I mean, really, of all the things to worry about going into commie Labor Day Weekend.
And yet, I just know I will be worrying about bad metadata all weekend…
Editor's Note: As of February 29, 2024, commenting privileges on reason.com posts are limited to Reason Plus subscribers. Past commenters are grandfathered in for a temporary period. Subscribe here to preserve your ability to comment. Your Reason Plus subscription also gives you an ad-free version of reason.com, along with full access to the digital edition and archives of Reason magazine. We request that comments be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of reason.com or Reason Foundation. We reserve the right to delete any comment and ban commenters for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century?
Who cares? What are you, a fag?
Google obviously needs to hire a cataloger. Anybody got Sergei's email address?
It's a good thing encyclopedia writers 100 years ago didn't get anything wrong-we'd never know what's what!
My plan is not to be president; it is to be erroneously believed by history to have been the president.
Frito +1
Also, who the hell cares about meta-data? The EU has poured millions into NEPOMUK semantic social desktop crap, but it turns out that the word 'meaning' is not an adjective.
ProL - 48th president or 1st metapresident?
My writer buddy, JT, began uploading his books as they were published, beginning in May 2009, with Google Books caution that it can take two months to become browseable. A few days ago all three were up and readable, including the most recent from last month.
Looks like Google has more issues going on than crappy metadata. Need to have him check that too.
"Less bad is not good. Less bad is not good. Less bad is better than more bad, but it is not good."
- Vice President Joe Biden
Was he speaking of Google or Yahoo there?
Dogs and cats living together, mass hysteria!
No good deed goes unpunished. Google creates a huge library where none existed before, and all most people can do is whine.
Just to be clear here, I really doubt that Google is somehow putting the copyright pages of these scanned books into Photoshop and changing the dates - so the correct data is there if you aren't too lazy to read the damn pages.
It sounds like this guy is complaining that if he searches using the Google tool, he gets some unusual results. And that's a lame complaint. If your library goes from having 100,000 books to having every book ever printed, it doesn't really matter if a few of the index cards in the card catalogue are out of order - you still have much more useful information than you had before.
But maybe I'm biased because I search by subject and not by date, and simply having access to a shitload of works from British university presses that would never have been available to me before is so awesome to me that it completely blows out of the water any question of whether or not the search tool is perfect.
You mean the punk rock episode of Quincy M.E. is not an unimpeachable historical record???
Wow - talk about life imitating art. This whole Google books scenario sounds like it was lifted right from the pages of Vernor Vinge's novel Rainbows End. Read it for a dystopian view of how this might end.
Tim already refuted this, but I can imagine a sufficiently advanced nanotechnology producing a host of invisibly tiny bookworms to quietly crawl over the planet and copy, in molecular detail, all permanent records of human knowledge, whether it be stored optically, magnetically, on paper, in cuneiform, woven into knotted strings, or scratched into the stall divider of a bus station men's room.
I always wondered why I gravitated toward a linguistics degree.
Fluffy - you're right, of course.
I still wish Google would pay me to catalog the collection properly.
I only use the 1950 Encylopedia Britannica. Also the subsequent ones are just a bunch of anti-Stalinist crap.
"but I can imagine a sufficiently advanced nanotechnology producing a host of invisibly tiny bookworms to quietly crawl"
Yeah, I saw those when I stopped drinking. Not invisibly tiny enough, believe me.
Was he speaking of Google or Yahoo there?
He was speaking of Bush. He's still wrong.
BB,
Tim already refuted this, but I can imagine a sufficiently advanced nanotechnology producing a host of invisibly tiny bookworms to quietly crawl over the planet and copy, in molecular detail, all permanent records of human knowledge, whether it be stored optically, magnetically, on paper, in cuneiform, woven into knotted strings, or scratched into the stall divider of a bus station men's room.
Was that from a Neil Stephenson book?
I'm surprised they haven't subscribed to OCLC, and gotten copy catalog records.
Oh, and I don't know why Nunberg assumes no one will ever re-scan books. Books that get a lot of use are almost certain to be either re-scanned somewhere else - after all, gmail was down for 1 1/2 hours. Heavily used books will have duplicates.
Also, a lot of books have already been "scanned" into microfilm or fiche, but haven't been converted to a digital format, because of copyright issues.
Oh, and I don't know why Nunberg assumes no one will ever re-scan books. Books that get a lot of use are almost certain to be either re-scanned somewhere else - after all, gmail was down for 1 1/2 hours. Heavily used books will have duplicates.
In defense of my friend Geoff's article let me point out that he's saying that Google's book scanning has deficiencies when being used for specific scientific purposes. The fact that the dating is screwed up makes the tool suspect for doing historical research. If you can't trust the date that Google returns for a manuscript you can't use that date as evidence for something (such as the first use in print of a particular word). Geoff's complaint is actually old news among linguists who study the history of the English language. It looked like it would be an incredible tool, but is turning out to be very dicey. And precisely because the manuscripts being searched are not the ones that get a lot of use, but rather the ones that nobody has looked at in a hundred years, they are very unlikely ever to be scanned again (who would want to once it's done?). That's his point. Once the actual physical scanning has been done, it's the scans that will be reprocessed, since it's unlikely libraries will be happy with the books (especially the old ones) being manhandled a second time. And for simply 'looking up stuff' that's OK. But for using the manuscripts as historical artifacts, it's not. It's like getting the carbon dating or stratigraphy wrong in an archeological site before burying the objects again.
Is this competition for The Google?
http://www.haaretz.com/hasen/spages/1112386.html
Same page has a blurb about a Hezbollah chemical weapons stockpile blowing up in south Lebanon last month.
It's true there's no Moore's Law for capture. But there's also a finite body of print and ink to be captured.
But it's really not true that there's no Moore's Law for capture. It's very likely that improved software will be re-run on the scanned images to produce much better electronic texts than the first time around.
How did the use of "propaganda" rise and fall by decade
It fell but never really disappeared. It became cable news. And blogs.
Suki -
I'm not familiar enough with Stephenson's works to say, but it certainly sounds like something he would dream up. I do know of a similar, more modest proposal from Ray Kurzweil, the Document and Image Storage Invention, or "DAISI". It would have been capable of reading many obsolete data formats and translating them to a durable storage medium. Supposedly he tinkered with the idea for some years but never really perfected it. There's a short podcast about it here.
Nipplemancer,
It doesn't really matter. I'd like to confuse history by having three terms, though.
"Less bad is not good. Less bad is not good. Less bad is better than more bad, but it is not good."
- Vice President Joe Biden
Is this an actual quote? A Google search (ironic, I know) turns up all of one hit: this very thread.
This is very similar to the the Google - United Airlines - "most viewed" - bankruptcy thing that sent a different set of dorks into orbit last year. When you are handling mountainous fuckloads of data, if the dates are missing or are unconventionally placed or formatted, you are going to miss some of them until you account for all of the exceptions (until new exceptions are created).
The only stupid thing that Google did, and it was quite stupid, was to default to 1899 when extraction failed.
When did "the United States are" start to lose ground to "the United States is"?
About the time the hostile takeover of the Constitution was completed, leaving us with a coercive powerful federal government instead of a loose amalgamation of 50 states.
I'm gonna take a wild guess and say it picked up steam right around the Civil War, when that bastard Lincoln took away the right to secede if a given state took issue with how the feds were acting. You know, the war that changed "THESE united states" into "THE United States" (also note the change in capitals.)
/rant
prolefeed, both the U and the S in United States are capitalized in the Constitution itself. The S actually doesn't prove anything, since nouns were always capitalized according to the conventions of the time, but the U does signify that the entire name was considered a proper noun.
quoted section from WSJ. It's a retarded statement. The sad thing is people, by their very nature, get what he means and not how stupid such a statement is.
Aren't those bookworms a Mycroft invention? Instead of nanotechnology, I think it was bioengineering.
prolefeed,
We didn't have 50 States until you and SugarFree were in high school.
😉
Tom,
I transcribed it as I heard it on the radio. They replayed it several times so I know I have it correct as he spoke it.
Might try YouTube instead of text.
BlueBook,
My writer buddy, JT, comes up with stuff like that. He isn't into much of the nano stuff. I think he is going to be using carbon nanotube bundles in the arms of Suki and John's competition recurve bows. Mentions some nano-sensor stuff in the cars, but mostly it is the brains and storage of computing devices have gotten predictably more powerful and smaller.
How cute it is watching you guys worship the Federalist Constitution as some bastion of liberty. When your descendants cling to the words of the PATRIOT Act as if it was some guarantee of freedom, you'll know how I feel.
"Antifederalist" has traditionally been the catch-all term to include people who prefer States to be dissolved and States to become/remain completely sovereign.
I would assume you subscribe to the latter, but that's not how it's usually used.
(FWIW, personally I prefer the original AoC.)
>>>When did "the United States are" start to lose ground to "the United States is"?
About the time the hostile takeover of the Constitution was completed, leaving us with a coercive powerful federal government instead of a loose amalgamation of 50 states.
Americans have always had a penchant for singularizing collective nouns -- unlike the British, who are content to say Oasis have broken up and Tottenham are having a great Premier League start. (Unless the speaker is an Oasis fan and a Tottenham foe, of course.)
I could be completely talking out my ass, but the transformation to "United States is" might have been political only in the sense that it was part of the broader (and deliberate) evolution away from British conventions.
Thorny devils are obligate ant specialists, eating virtually nothing else. They will consume several species of ants, but are especially partial to very small Iridomyrmex ants, especially Iridomyrmex flavipes. Feeding rates have been estimated at from 24 to 45 ants per minute. Occasional objects such as small stones, sticks, tiny flowers and small insect eggs are also ingested -- these are probably objects being carried by ants and are eaten only accidentally. Large numbers of ants are eaten per meal by an individual thorny devil (estimates range from 675 to 1000-1500 to 2500)
Fecal pellets of thorny devils are very distinctive: black, glossy, perfect prolate spheroids. These are often found in neat piles either in the open or amongst sparse vegetation. Individuals have specific defecation sites, separate from their basking and feeding sites. Tracks and accumulations of fecal matter indicate that thorny devils often return to such spots several days in succession.
Water Uptake
Thorny devils have a hygroscopic system of grooves in their skin that lead to the corners of their mouth. Bentley and Blumer (1962) showed that thorny devils take up water by means of capillary action via these grooves. Thorny devils use a gulping oral mechanism to move water along the grooves and into their mouths. Thorny devils can actually drink water from dew that falls on their backs and they can gain as much as a gram of water in a rainstorm.
Oasis are really British sounding folk, like British^n
These long weekends suck here. How do we get a new topic?
Holy crap. I've referred to the United States as a singular entity without even thinking about it until now.
Ok, from now on, plural it is. The United States are saddled with a bloated federal government.
-jcr
Hezbullah had a big old chemical weapons cache blow up in Lebanon last month, if anybody cares about that sort of thing.
Maybe they were trying to get down with the Chemical Weapons Convention. "Hey, that's not how you destroy them!"
Eric,
I had the same thought about Rainbow's End. That was really the most awesome part of the book, with giant blowers and shredders dismantling the library while scanners and supercomputers reconstructed the scraps electronically. Given how the Human Genome Project ended up operating, it could almost work...
Hey, nice Larkin reference in the title.
Suki | September 5, 2009, 10:22pm | #
These long weekends suck here. How do we get a new topic?
Well, we can discuss thorny devils now that Zoology Saturday's brought them up. If they gather together, who knows what kind of thorny dark magic they can wreak? Does a permanent gathering of thorny devils constitute Pandemonium? If devils shit black fecal pellets, what do angels excrete? And how does this compare with the excrement of Kim Jung Il?
Your mission: collect a stool sample from Kim Jong Il.
Dear Leader does not excrete stool; Dear Leader gives the gift of fertilizer to the People.
That should be easy enough. Just cut off a piece from anywhere.
"Google Books" is the name of a singular application. The headline ought to be "Books IS a load of crap."
That one almost got by me. Restayas, too.
There's actually been a lot of good discussion on these problems from the google people. See the following post from the Language Log blog and the comments from the Google Catalog team leader below: http://languagelog.ldc.upenn.edu/nll/?p=1701