Dry As Dust

A Fortean in the Archives


Chronicling America

Newspaper digitisation is going to be so important to Fortean researchers that I can see myself touching on the topic frequently for the foreseeable future. Right now, though, it’s enough to note that a project with the potential to become the most important and most accessed of all online newspaper repositories was launched a few days ago by the Library of Congress.

    It's called Chronicling America and it’s an offshoot of the old United States Newspaper Program, a cataloguing project that’s been running since 1980. The long-term aim is to digitise and make freely available complete runs of representative newspapers from all parts of the United States. For now, the project consists of a prototype site containing runs of only two dozen titles, and it covers only the years 1900-1910.

    The entire archive is fully keyword-searchable, though, and even in its prototype state it’s clear that it has enormous potential to allow more detailed, more comprehensive, and much faster newspaper research than poor old Fort, ruining his eyes scanning endless runs of unindexed paper originals, can possibly have dreamed of. The LoC has arranged for grants to be made to libraries taking part in the project, and those libraries have chosen the titles made a commendable effort to select an wide and representative selection of papers for its prototype site – the early ethnic weekly The Colored American is among the papers chosen for preservation in this way. So far, titles from California, the District of Columbia, Florida, Kentucky, New York, Utah, and Virginia are represented; a heavy preponderance come from Washington, and only one daily, the New York Sun, represents the Empire State. The choice of the long-defunct Sun, incidentally, points up a significant limitation of the project as a whole; now that the value of digitisation is becoming widely recognised, major papers such as The Times, the Scotsman and the New York Times are increasingly putting their own archives online through paid-for sites. Free, public sites such as Chronicling America are only going to be able to digitise out-of-copyright papers from publishers that are no longer around.

    What, then, of the site itself? The first thing to make clear is that, even in its prototype state, Chronicling America is truly impressive in its scale and ambition. From a purely technical point of view, however, there are certainly drawbacks to the way the LoC and its partners have produced the site. To begin with, they’ve chosen to develop a new user interface rather than licensing a tried-and-tested alternative such as the Olive Software package used by several highly successful existing sites – the superb Brooklyn Daily Eagle online resource is one example of a digitisation programme run via Olive. This means, unfortunately, that while the Eagle site responds to searches by displaying clips of headlines, and gives a word count for each article as well as the date of publication, the LoC’s results are limited to just newspaper title and date. This makes it impossible to gauge the usefulness of each article without accessing the page concerned – a process that means sitting through a 30 or 40 second load per page. It sounds a small gripe, and it is – no one who’s ever searched a run of newspapers on microfilm is going to begrudge the relatively short time it takes to access text via the LoC site - but those seconds do add up when scrolling through results that can run into the hundreds.

    A far more serious problem came to light when I did some tests to evaluate how accurately Chronicling America indexes its material. Choosing one sample article from the New York Sun (8 October 1910 p.2) – an account of a shooting in New York’s Café Maryland involving a femme fatale by the name of Ida the Goose – I found that the system obstinately refused to identify the article when responding to ‘exact’ searches for “Ida the Goose” and “Café Maryland” (the latter both with and without its accent) or even when entering the same search phrases into the “with ALL of the words...” box. Case sensitivity doesn’t seem to be the issue, and I still don’t know why the only search phrase that successfully brought up the article was the name of the Maryland’s former owner, a renowned fixer for New York’s confidence men who went by the memorable moniker of Dan the Dude. I’ve accessed other articles since then that contain words and phrases I’ve searched for previously without turning up the newspaper page in question, which implies the indexing problem could be a significant one. It’s understandable, of course – minuscule text, scanned in at least some cases from poor quality microfilm rather than original copies, is a nightmare for any OCR. But digitisation projects live and die by the quality of their searches; to get such discouraging results from a site with the long-term importance of Chronicling America is frankly worrying.

    My other real gripe with the new site concerns the way in which it displays texts. Results are presented as full page TIFF files via Flash, at a relatively decent resolution – 400dpi – rather than as single stories; it’s then up to the reader to identify the area of each page required. This is an interesting choice – Olive Software presents single articles rather than whole pages – and it’s arguable that there are advantages to seeing stories in the context of the page. But identifying and zooming in on the stories you want involves further waits for loading and, in the case of longer articles, a certain amount of fiddling around with the magnification settings. Moving around the page at higher magnifications, meanwhile, is an art that took me half a day to master; keywords are highlighted in red, which makes spotting the stories you are after easy enough, even on a densely-filled, barely headlined early twentieth century broadsheet page. But it took me some time to realise that the best way to move was to click and drag each page around within the Flash window, and even that’s a relatively time-consuming business.

    Two other problems are worth mentioning. First, and bizarrely, the newspapers – scanned as they are from black and white film - are presented, in effect, in colour; that is, they print with a curious dull yellowish background colour behind the text. Not only does this make them harder to read; it’s also a real problem for anyone making copies of material via a home colour printer because printing even a few dozen pages of text will make unexpected demands on users’ stocks of cyan and yellow ink; researchers should certainly consider switching to black and white printing before using this site. Secondly, while it is possible to zoom in on particular articles of interest and print just the text that interests you at high magnification, actually saving these blown up stories rather than the whole page isn’t as straightforward as it might be. Clicking on the brown ‘More options’ bar that sits just over the viewing window does bring up an option to ‘Save as pdf’, but this option saves entire pages rather single stories.

    The site's Help section does imply it’s possible to save smaller sections this way using a scaling tool, but my own experience is that the easiest way of pdfing single stories is to zoom in on the text that interests you, order the system to print (which brings up your selection in a single window with a handy caption identifying source, date and page number) and then hit the ‘Save as pdf’ button on the print menu. Oh, and just to complicate matters further (and for reasons that are beyond me but are presumably a function of the need to save space), all the site’s pdfs are presented at a mere 150dpi, which makes it harder than it needs to be to save good quality copies of stories from the site.

    There are plenty of good things about Chronicling America as well, of course. The reporting of initial results is fast, given the size of the archive being searched; negative feedback usually take no more than a couple of seconds to return. And there are more search options than are usually found on sites of this sort – it’s possible to hunt for ‘any of the words...’, ‘all of the words...’, ‘the exact phrase...’ and ‘with the words ... within 5 words of each other’. This last option is a valuable innovation that I hope will be widely copied – it’s especially for locating articles concerning two people who once worked together, which is something I’ve been busy trying to do this month. In time, Chronicling America will grow, and it’s hard to overstate how important the site will become once it offers material across the whole span of US history. Even as it stands, though, I’d encourage all Forteans and historians with the least interest in the years 1900-1910 to give it a go. The sheer quantity of raw data already available is so vast few will be disappointed. Well over a decade before Charles Fort first attained any sort of prominence, for instance, a search for his name (entered as an ‘exact phrase’) turns up several hitherto unknown mentions of the man and his work, among them the Salt Lake Herald’s review of the Fort short story ‘Christmas waifs’, published in Smith’s Magazine, (‘Whimiscal, humorous and sympathetic’), and an advert for Fort’s one full-length work of fiction, The Outcast Manufacturers placed by the publisher, BW Dodge, in the New York Sun of 1 April 1909 (‘A novel of slum life that is going to be talked about, unique in plot and method. “A series of vivid fragments of description, as incisive and forceful as a physical blow.”’)

    Hopefully the National Digital Newspaper Progam will make rapid strides from now on. There’s certainly a lot more work in progress, and some projects that have already been completed are yet to appear online - a look at the proposal prepared by the University of Florida as its pitch for a grant from the NDNP shows that several other titles from that state – the Tampa Tribune, for example, and the Florida Dispatch for 1869-1888 – have already been digitised; no doubt similar, or greater, progress is being made in other states. New digitisation projects are being announced and coming online all the time – it’s already getting hard to keep track of them all – and it seems clear that, within a decade, several thousand titles and several tens of millions of pages will be available over the net, many of them – thanks be to the liberal governments and libraries of the world – absolutely free.

    Newspaper digitisation has already transformed the way I work and (not least for Forteans, for whom newspapers are life’s blood) the new technology is revolutionising entire fields of study. It’s a sobering though that, by the time the process is completed, it will be possible for even a novice researcher, equipped with a list of keywords, to duplicate in a matter of hours work it took Fort months and years to complete – and to do so cheaply, easily, and without leaving the comfort of his or her own home.

Trackback URL for this post:

http://blogs.forteana.org/trackback/13
xxx