Friday, October 19, 2012

I scanned a book ...

... all of it, for the first time yesterday. It's a book called "Short Papers on Church History, Vol. 1", by Andrew Miller. The preface is dated 1873, and there are about 600 pages and no illustrations.

What's the point? For some years, I've been bugged by the fact that there is a vast treasure-trove of literature represented by books that are no longer available. Other people are addressing this in a more organised manner - Project Gutenberg prepares and makes available electronic versions of out-of-copyright books. I've done some "volunteering" for them, through the PGDP (Distributed Proofreading) website - I did the post-processing of "A Scout of To-day" and "The Captive in Patagonia", and I'm hoping I've done the post-processing of a 1920s Pharmacopeia sufficiently well that this will be published soon.

The Internet Archive is also endeavouring to scan old books from libraries and wherever else they can be find, and make the scans available - this is being used as a resource of books that can be OCRed for Project Gutenberg. Google are doing similar things - however, although they are grabbing information that is out of copyright, as a corporation rather than a public concern, they are less interested in making this publicly available.

Having volunteered for PG, I have a lot of sympathy with their aims and what they are trying to do. However, I also disagree with the approach they have taken in some areas. For example, the downside of leaving their output "open" is that there's technically little to stop people grabbing the text, chopping off the PG bits, and then "publishing" it as "their" e-book. There's not loads of money to be made from this, but neither is there any effort or real risk involved. PG could avoid this by publishing their own copies directly onto Amazon (for example), charging a token fee (which could be ploughed back into the foundation) and using their space on Amazon to highlight the care taken over their transcriptions. But since it was founded on a basis of "freely available" - a laudable principle - this is a direction that they are reluctant to move in. The consequence of taking themselves out of the marketplace unfortunately does them few favours.

I also think they are excessively careful in their process of text production. The ideal is for a text to go through three stages of proofreading - and this is done VERY thoroughly, followed by two stages where formatting is put back into the text. Finally, there is post-processing - here, HTML and TXT versions are generated. All well and good - but the fact of the matter is that this is being done far more carefully than the proofreading process by which the books were originally prepared. Thus, a significant amount of time in this process is spent wondering whether or not to preserve errors and inconsistencies in the original. This is an important aspect of paleography, as a friend pointed out yesterday - but less relevant in the era of large volume printing. Timewise, the process is thus dominated by large amounts of time spent scanning for minor errors - commas transcribed as full stops; digit 1s transcribed as letter ls, and so on. This is where new PGDP volunteers start, and unfortunately, large numbers never get beyond this. Whilst this is an important part of the process, the fact that so many of the people who are sufficiently interested in the process to volunteer don't last very long is a problem. The final issue is the "voluntary sector defensiveness" - my experience in conversations with people on their forums led me to see that the people immersed in the system rapidly got prickly when I made comments about issues that I saw. I understand this - it's as irritating as anything to have newbies telling you that you're doing things wrong - but at the same time, the failure of the website to convert large numbers of enthusiastic people into long-term volunteers is a big issue.

New OCR software is much more adept at recognising texts. My scanning of the book yesterday was in part to see how ABBYY FineReader 10 coped. The answer is, pretty well. From a couple of hours scanning, I ended up with a pre-proofread text of the whole book. Typically, it generated about 5-10 queries per page - this represents a substantial checking requirement, over 600 pages - and also obviously issues with formatting. However, as a proportion of the text it's low. It leads me to think that it would be possible to produce an electronic text from a scanned book in a manageable time-frame, without having to do it the PG way. Perhaps it won't come close to the perfection of PG books - see the challenge laid down by Michael Hart regarding Carroll's "Alice" books. But it would at least help to get the texts of old, obscure books into the public arena.

No comments: