Dry As Dust

A Fortean in the Archives


For Good or Evil? Google Books

It's hard to think of a major web-based archive initiative that's been quite so controversial as Google Books. This project - which Google itself announces as an effort to scan and index every book, in every language, ever published, and to make the results freely available to readers over the internet - has enthralled researchers, to whom it promises untold intellectual riches, while sending publishers racing to their lawyers screaming copyright infringement.

It's easy - and I speak here as an author - to see why the industry is worried. Aggressively or badly managed, Google Books plainly does have the potential to impact on publishers' revenues and writers' copyright. If it became possible, for instance, to download or print , free of charge, sections of books, or books in whole, Google would at the very least have succeeded in revolutionising the industry as fundamentally as Gutenberg. Scraping a living from one's writing is an uncertain enough business at the best of times, without worrying about the collapse of the financial underpinnings of a business that's seldom been a beacon of either profit or efficiency.

That said, it has to be admitted that Google has been scrupulous, so far, in protecting and maintaining authors' rights. Out-of-copyright works are available in their entirety (and to be fair there's obviously potential for some mayhem here, given the varying periods for which copyright applies in different territories), but even these can only be read, somewhat laboriously, on screen - not downloaded or sent to print. In-copyright works are presented either with a handful or sample pages (in cases in which the author or publisher has opted in to the scheme) in the form of what is, essentially, an index - 'Snippet View', they call it, by which is meant a tiny extract, three lines long, centred around the word or phrase one's searching for. Hopefully that's enough for readers to tell whether or not the reference they've found is of value, and to go off and buy the book concerned, or find it in a library - though as things stand Google's automated technology hasn't exactly mastered the art of presenting every one of its results effectively. Snippet View often turns out to be more of a tease than a useful resource.

Google has lined up an impressive list of partners to assist with the mass of work involved in scanning an estimated 32 million books in less than a decade: among its partners are the Bodleian Library in Oxford, the New York Public Library, and the Universities of Michigan, Princeton and Mysore. And its database is certainly expanding rapidly; at the rate of several thousand books a month, at least. The majority (it's my impression, anyway) are nineteenth century works of the sort that are often hard to hunt through manually, not least because few books of that period were indexed. That's a bonus for the likes of me.

Whether one thinks that the service is a good or a bad idea, though, really depends on two things: how much time one devotes to research, and how seriously one takes Google's protestations that it will always respect authors' and publishers' rights. As an inveterate researcher, I find few online tools so valuable as Google Books. As a writer with copyrights to protect, I will admit to feeling more than a little impotent as corporations so vast I would have no prospect of challenging their decisions press relentlessly ahead with projects that have the potential to deliver world domination on a scale scarcely dreamed of by Hank Scorpio. Like many other writers, I'm far from convinced by Google's protestations that they're happy to invest an estimated $100 million in a project of this sort without having, somewhere, a plan to make a profit from it all. Google-owned YouTube certainly takes a very cavalier approach to rights, and I can only hope that someone, somewhere, keeps the company honest - at least for the next 95 years or so, while my copyrights expire.

Of course, not every web user is bothered by this side of the argument; many, indeed, aren't even aware that the project exists. Google Book Search results do pop up as part of a regular search with the main Google engine - sort of; that is, the company presently limits the total of book results reported to a maximum of three, and these can easily get lost amid the vast mass of returns generated by an ordinary Google search. To get full value, and a real feel, for the Google Books project itself, one needs to log onto its home page, accessible by clicking on the "more" tab to the top right of the search entry box on the main Google page, or by clicking "Advanced search" and scrolling down to a list of options on the bottom left of the page. Entering the same search terms there, in exactly the same way one would normally do with Google proper, bring up a whole new world of results.

Take the results that come up in a search for my own perennial obsession, Spring-heeled Jack. The main Google search engine produces 89,300 hits, some pointing to the various musical groups that have adopted the name, and the rest to a welter of mostly-dodgy websites that do little but regurgitate the usual half-truths and myths that permeate the subject. In fact, although I years ago instituted a regime of scrolling through the entirety of Google hits every 6 months or so for the exact search "Spring-heeled Jack" (which, if one uses a couple of identifiers such as "-lyrics" and "-tabs" to filter out the music sites, nowadays runs to around 80 pages of results), I feel lucky if I turn up one or two really useful new pieces of information in a year - that from a survey of at least 800 sites. Google Books produces more targeted results with much less hassle. A search on the same exact phrase currently yields only 366 hits, but almost half are of real interest to the Fortean researcher.

Spring-heeled Jack is, in fact, a pretty good test of the value of the Books project as a whole. The phrase became a generic one, used to describe all manner of ghosts and bogey-figures, in the course of the Victorian era, which means that it makes numerous appearances in books of the period. No researcher, in the normal course of things, could hope to trace such fleeting and elusive references; they crop up unpredictably and are generally un-indexed in the original works themselves. Google Books, though, corrals them in an instant, allowing someone like me to trace the spread and usage of a phrase in a manner that would simply not have been possible a few years ago. And it turns up plenty of new reports that would otherwise have remained unknown as well; among the current hits found by the site are an important new set of letters to Notes and Queries, dating to 1893, that I failed to find despite hand-searching, years earlier, all the printed indexes to that journal; a mid-century legend from Croydon, reported in Harper's Monthly; and a cite from Mayhew's London Labour and the London Poor, volume 3 page 52, that would have been time-consuming, to say the least, to track down by hand even had one known of its existence.

The inevitable downsides do exist, of course. Some concern has been expressed over the reliability of Google's scanning technology. The company has not discolosed precisely how it is digitising books, but it is widely assumed that the process has been somehow automated, using robotic page-turners and digital photography. There have been reports of missing pages and of segments of books so poorly scanned that they are rendered unreadable. I have never encountered examples of either, and I hope the problem is a minor one. But more serious - and far more common in my personal experience - is the difficulty Google Books apparently experiences in correctly identifying text secured from one volume of a long run of journals. For some reason, the system all too often seems to default to either the first year or the last year of a run when reporting periodical results, so that one obtains a cite that gives the correct page number but identifies the wrong volume of what might well be a run of 50 or 100 stout, bound books. When this happens with a volume protected by Snippet View, it's all too often impossible to identify the actual year of a report from the scant clues available, leaving one with no alternative but to check, say, page 353 of every volume of a journal for a 20-, 30- or even 50-year period - a procedure scarcely less tedious than research was in pre-digital days and that can be, in the case of journals not available on open shelves, effectively impossible.

I will, I'm sure, have more to say on Google Books from time to time. For now, it's enough to observe that almost any Fortean researcher will obtain valuable results by searching it, and that the hits obtained on Google Books are of a far higher calibre than those returned by ordinary search engines. My vote is cautiously in favour.

Trackback URL for this post:

http://blogs.forteana.org/trackback/17
xxx