Tim Cavanaugh | September 4, 2009
This week's view from
inside the gift horse's mouth comes from linguist Geoff Nunberg, who assesses the mixed results
of Google Books. His focus is on metadata giving publication dates
for particular editions. These dates are crucial for doing some
pretty interesting research:
Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri.
Unfortunately, says Nunberg, Google Books' metadata are "a mish-mash wrapped in a muddle wrapped in a mess." Some pretty amusing examples:
Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few.
It's never enough just to point out the problems, though. Nunberg must envision a futuristic nightmare in which misdating of Stephen King novels leads to anarchy:
This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now - Google? UNESCO? Wal-Mart? - these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.
I'm a historical-error buff, fully confident my descendents will believe that Hillary Clinton was prime minister of America during World War II and that the punk rock episode of Quincy M.E. is an unimpeachable historical record. But even I have a hard time thinking that a hundred years from now, in a culture of teleportation and Smellivision and neurally implanted broadband connections, scholars will still be at the mercy of uncorrected metadata Google has put together over the past few years.
It's true there's no Moore's Law for capture. But there's also a finite body of print and ink to be captured. Has Google been studiously destroying all the originals after scanning them? Who's to say libraries and book collectors all over the world won't get into the act, now that it's clear that a project like this can be done? And even if (despite its statements to the contrary) Google proves unwilling to make corrections when Nunberg and other good citizens point out errors, isn't it probable, or inevitable, that somebody has already mirrored the whole Google Books project and will be able to create a more accurate data set?
Far be it from me to say no to a little Google Hate. But my initial experiences with Google Books have led me to say nothing but "Thanks for this good if imperfect thing that never existed before in human history." I mean, really, of all the things to worry about going into commie Labor Day Weekend.
And yet, I just know I will be worrying about bad metadata all weekend...
Help Reason celebrate its next 40 years. Donate Now!
Try Reason's award-winning print edition today! Your first issue is FREE if you are not completely satisfied.
Can we observe the way happiness replaced felicity in the
seventeenth century, as Keith Thomas suggests? When did "the United
States are" start to lose ground to "the United States is"? How did
the use of propaganda rise and fall by decade over the course of
the twentieth century?
Who cares? What are you, a fag?
It's a good thing encyclopedia writers 100 years ago didn't get anything wrong-we'd never know what's what!
My plan is not to be president; it is to be erroneously believed by history to have been the president.
Frito +1
Also, who the hell cares about meta-data? The EU has poured
millions into NEPOMUK semantic social desktop crap, but it turns
out that the word 'meaning' is not an adjective.
My writer buddy, JT, began uploading his books as they were
published, beginning in May 2009, with Google Books caution that it
can take two months to become browseable. A few days ago all three
were up and readable, including the most recent from last
month.
Looks like Google has more issues going on than crappy metadata.
Need to have him check that too.
"Less bad is not good. Less bad is not good. Less bad is better
than more bad, but it is not good."
- Vice President Joe Biden
Was he speaking of Google or Yahoo there?
No good deed goes unpunished. Google creates a huge library where none existed before, and all most people can do is whine.
Just to be clear here, I really doubt that Google is somehow
putting the copyright pages of these scanned books into Photoshop
and changing the dates - so the correct data is there if you aren't
too lazy to read the damn pages.
It sounds like this guy is complaining that if he searches
using the Google tool, he gets some unusual results. And
that's a lame complaint. If your library goes from having 100,000
books to having every book ever printed, it doesn't really matter
if a few of the index cards in the card catalogue are out of order
- you still have much more useful information than you had
before.
But maybe I'm biased because I search by subject and not by date,
and simply having access to a shitload of works from British
university presses that would never have been available to me
before is so awesome to me that it completely blows out of the
water any question of whether or not the search tool is
perfect.
You mean the punk rock episode of Quincy M.E. is not an unimpeachable historical record???
Wow - talk about life imitating art. This whole Google books scenario sounds like it was lifted right from the pages of Vernor Vinge's novel Rainbows End. Read it for a dystopian view of how this might end.
There's no Moore's Law for capture, and nobody is ever going to scan most of these books again.
Tim already refuted this, but I can imagine a sufficiently advanced
nanotechnology producing a host of invisibly tiny bookworms to
quietly crawl over the planet and copy, in molecular detail, all
permanent records of human knowledge, whether it be stored
optically, magnetically, on paper, in cuneiform, woven into knotted
strings, or scratched into the stall divider of a bus station men's
room.
What are you, a fag?
I always wondered why I gravitated toward a linguistics degree.
Fluffy - you're right, of course.
I still wish Google would pay me to catalog the collection
properly.
I only use the 1950 Encylopedia Britannica. Also the subsequent ones are just a bunch of anti-Stalinist crap.
"but I can imagine a sufficiently advanced nanotechnology
producing a host of invisibly tiny bookworms to quietly
crawl"
Yeah, I saw those when I stopped drinking. Not invisibly tiny
enough, believe me.
Was he speaking of Google or Yahoo there?
He was speaking of Bush. He's still wrong.
BB,
Tim already refuted this, but I can imagine a sufficiently
advanced nanotechnology producing a host of invisibly tiny
bookworms to quietly crawl over the planet and copy, in molecular
detail, all permanent records of human knowledge, whether it be
stored optically, magnetically, on paper, in cuneiform, woven into
knotted strings, or scratched into the stall divider of a bus
station men's room.
Was that from a Neil Stephenson book?
I still wish Google would pay me to catalog the collection properly.
I'm surprised they haven't subscribed to OCLC, and gotten copy
catalog records.
Oh, and I don't know why Nunberg assumes no one will ever
re-scan books. Books that get a lot of use are almost certain to be
either re-scanned somewhere else - after all, gmail was down for 1
1/2 hours. Heavily used books will have duplicates.
Also, a lot of books have already been "scanned" into microfilm or
fiche, but haven't been converted to a digital format, because of
copyright issues.
Oh, and I don't know why Nunberg assumes no one will ever
re-scan books. Books that get a lot of use are almost certain to be
either re-scanned somewhere else - after all, gmail was down for 1
1/2 hours. Heavily used books will have duplicates.
In defense of my friend Geoff's article let me point out that he's
saying that Google's book scanning has deficiencies when being used
for specific scientific purposes. The fact that the dating is
screwed up makes the tool suspect for doing
historical research. If you can't trust the date
that Google returns for a manuscript you can't use that date as
evidence for something (such as the first use in print of a
particular word). Geoff's complaint is actually old news among
linguists who study the history of the English language. It looked
like it would be an incredible tool, but is turning out to be very
dicey. And precisely because the manuscripts being searched are
not the ones that get a lot of use, but rather the ones
that nobody has looked at in a hundred years, they are very
unlikely ever to be scanned again (who would want to once it's
done?). That's his point. Once the actual physical scanning has
been done, it's the scans that will be reprocessed, since it's
unlikely libraries will be happy with the books (especially the old
ones) being manhandled a second time. And for simply 'looking up
stuff' that's OK. But for using the manuscripts as historical
artifacts, it's not. It's like getting the carbon dating or
stratigraphy wrong in an archeological site before burying the
objects again.
Is this competition for The Google?
Israeli researchers said yesterday they are developing a computer program that will make ancient documents more legible and easily indexed. The program, which is being developed by a team of computer scientists and historians at Ben-Gurion University of the Negev, would make ancient texts that have faded, smudged or been written over easier to read. Jihad El-Sana, a researcher on the project explained that the program will be able to determine which documents are original through a process called writer identification. "We are developing a kind of technology to enhance documents' visual properties for two reasons: to make them easier to read and because we want to archive and index them," El-Sana said. (AP)
http://www.haaretz.com/hasen/spages/1112386.html
Same page has a blurb about a Hezbollah chemical weapons stockpile
blowing up in south Lebanon last month.
It's true there's no Moore's Law for capture. But there's
also a finite body of print and ink to be captured.
But it's really not true that there's no Moore's Law for capture.
It's very likely that improved software will be re-run on the
scanned images to produce much better electronic texts than the
first time around.
How did the use of "propaganda" rise and fall by
decade
It fell but never really disappeared. It became cable news. And
blogs.
Suki -
I'm not familiar enough with Stephenson's works to say, but it
certainly sounds like something he would dream up. I do know of a
similar, more modest proposal from Ray Kurzweil, the Document and
Image Storage Invention, or "DAISI". It would have been capable of
reading many obsolete data formats and translating them to a
durable storage medium. Supposedly he tinkered with the idea for
some years but never really perfected it. There's a short podcast
about it here.
Nipplemancer,
It doesn't really matter. I'd like to confuse history by having
three terms, though.
"Less bad is not good. Less bad is not good. Less bad is
better than more bad, but it is not good."
- Vice President Joe Biden
Is this an actual quote? A Google search (ironic, I know) turns up
all of one hit: this very thread.
This is very similar to the the Google - United Airlines - "most
viewed" - bankruptcy thing that sent a different set of dorks into
orbit last year. When you are handling mountainous fuckloads of
data, if the dates are missing or are unconventionally placed or
formatted, you are going to miss some of them until you account for
all of the exceptions (until new exceptions are created).
The only stupid thing that Google did, and it was quite stupid, was
to default to 1899 when extraction failed.
When did "the United States are" start to lose ground to
"the United States is"?
About the time the hostile takeover of the Constitution was
completed, leaving us with a coercive powerful federal government
instead of a loose amalgamation of 50 states.
I'm gonna take a wild guess and say it picked up steam right around
the Civil War, when that bastard Lincoln took away the right to
secede if a given state took issue with how the feds were acting.
You know, the war that changed "THESE united states" into "THE
United States" (also note the change in capitals.)
/rant
prolefeed, both the U and the S in United States are capitalized in the Constitution itself. The S actually doesn't prove anything, since nouns were always capitalized according to the conventions of the time, but the U does signify that the entire name was considered a proper noun.
quoted section from WSJ. It's a retarded statement. The sad
thing is people, by their very nature, get what he means and not
how stupid such a statement is.
"I want to be clear about something: Less bad is not good," Vice President Joe Biden said. "That's not how President Obama and I measure success."
Aren't those bookworms a Mycroft invention? Instead of nanotechnology, I think it was bioengineering.
prolefeed,
We didn't have 50 States until you and SugarFree were in high
school.
;)
Tom,
I transcribed it as I heard it on the radio. They replayed it
several times so I know I have it correct as he spoke it.
Might try YouTube instead of text.
BlueBook,
My writer buddy, JT, comes up with stuff like that. He isn't into
much of the nano stuff. I think he is going to be using carbon
nanotube bundles in the arms of Suki and John's competition recurve
bows. Mentions some nano-sensor stuff in the cars, but mostly it is
the brains and storage of computing devices have gotten predictably
more powerful and smaller.
When did "the United States are" start to lose ground to "the United States is"?
About the time the hostile takeover of the Constitution was completed, leaving us with a coercive powerful federal government instead of a loose amalgamation of 50 states.
How cute it is watching you guys worship the Federalist
Constitution as some bastion of liberty. When your descendants
cling to the words of the PATRIOT Act as if it was some guarantee
of freedom, you'll know how I feel.
"Antifederalist" has traditionally been the catch-all term to
include people who prefer States to be dissolved and States to
become/remain completely sovereign.
I would assume you subscribe to the latter, but that's not how it's
usually used.
(FWIW, personally I prefer the original AoC.)
>>>When did "the United States are" start to lose
ground to "the United States is"?
About the time the hostile takeover of the Constitution was
completed, leaving us with a coercive powerful federal government
instead of a loose amalgamation of 50 states.
Americans have always had a penchant for singularizing collective
nouns -- unlike the British, who are content to say Oasis have
broken up and Tottenham are having a great Premier League start.
(Unless the speaker is an Oasis fan and a Tottenham foe, of
course.)
I could be completely talking out my ass, but the transformation to
"United States is" might have been political only in the sense that
it was part of the broader (and deliberate) evolution away from
British conventions.
Thorny devils are obligate ant specialists, eating virtually
nothing else. They will consume several species of ants, but are
especially partial to very small Iridomyrmex ants, especially
Iridomyrmex flavipes. Feeding rates have been estimated at from 24
to 45 ants per minute. Occasional objects such as small stones,
sticks, tiny flowers and small insect eggs are also ingested --
these are probably objects being carried by ants and are eaten only
accidentally. Large numbers of ants are eaten per meal by an
individual thorny devil (estimates range from 675 to 1000-1500 to
2500)
Fecal pellets of thorny devils are very distinctive: black, glossy,
perfect prolate spheroids. These are often found in neat piles
either in the open or amongst sparse vegetation. Individuals have
specific defecation sites, separate from their basking and feeding
sites. Tracks and accumulations of fecal matter indicate that
thorny devils often return to such spots several days in
succession.
Water Uptake
Thorny devils have a hygroscopic system of grooves in their skin
that lead to the corners of their mouth. Bentley and Blumer (1962)
showed that thorny devils take up water by means of capillary
action via these grooves. Thorny devils use a gulping oral
mechanism to move water along the grooves and into their mouths.
Thorny devils can actually drink water from dew that falls on their
backs and they can gain as much as a gram of water in a
rainstorm.
Holy crap. I've referred to the United States as a singular
entity without even thinking about it until now.
Ok, from now on, plural it is. The United States are
saddled with a bloated federal government.
-jcr
Hezbullah had a big old chemical weapons cache blow up in Lebanon last month, if anybody cares about that sort of thing.
Hezbullah had a big old chemical weapons cache blow up in Lebanon last month
Maybe they were trying to get down with the Chemical Weapons Convention. "Hey, that's not how you destroy them!"
Eric,
I had the same thought about Rainbow's End. That was
really the most awesome part of the book, with giant blowers and
shredders dismantling the library while scanners and supercomputers
reconstructed the scraps electronically. Given how the Human Genome
Project ended up operating, it could almost work...
Suki | September 5, 2009, 10:22pm | #
These long weekends suck here. How do we get a new
topic?
Well, we can discuss thorny devils now that Zoology Saturday's
brought them up. If they gather together, who knows what kind of
thorny dark magic they can wreak? Does a permanent gathering of
thorny devils constitute Pandemonium? If devils shit black fecal
pellets, what do angels excrete? And how does this compare with the
excrement of Kim Jung Il?
And how does this compare with the excrement of Kim Jung Il?
Your mission: collect a stool sample from Kim Jong Il.
Dear Leader does not excrete stool; Dear Leader gives the gift of fertilizer to the People.
...collect a stool sample from Kim Jong Il.
That should be easy enough. Just cut off a piece from anywhere.
"Google Books" is the name of a singular application. The
headline ought to be "Books IS a load of crap."
That one almost got by me. Restayas, too.
There's actually been a lot of good discussion on these problems from the google people. See the following post from the Language Log blog and the comments from the Google Catalog team leader below: http://languagelog.ldc.upenn.edu/nll/?p=1701
Site comments/questions:
Media Inquiries and Reprint Permissions:
(310) 367-6109
Editorial & Production Offices:
3415 S. Sepulveda Blvd.
Suite 400
Los Angeles, CA 90034
(310) 391-2245