Friday, July 18, 2014

Scan/OCR/Proofread

It's quite feasible to convert a text from a physical book to an electronic book. However, it's a multiple stage process.

The first stage is scanning the physical text. Here's a scanned image from a book called "A Memoir of Adolph Saphir D.D.".


Next, Optical Character Recognition (OCR) software has to be used to convert this from an image into text. This is pretty intensive in terms of computer power. I use Abbyy FineReader 10 Professional Edition. Here's a sample of the output from this (though much of the formatting has been lost in copying from a Word document into Blogger):
PREFACE.
TT has been impossible to publish sooner the Memoir of the lamented Dr. Adolph Saphir. On account of his sudden death, which followed so closely that of his wife, there was a delay in the settlement of his affairs; and, consequently, no access could be had to documents of any kind till about the middle of last year—a year after his death. When I was then asked to write the Memoir, much time and labour were required to collect letters and documents from friends and correspondents of Dr. Saphir. But though there has consequently been delay, the Memoir will, I believe and hope, be not less valued by devoted friends, of whom he had very many, nor less interesting to the general public.
A good quality scan makes a difference - by comparing the image and the text, you can see how good a job the software has done in "reading" the image.

However, the most intensive stage is still to come. That is proofreading the text that has been produced. FineReader will highlight places where it was unsure about the translation from image to text, which means that the file can be edited directly in the software. Alternatively, a rough word processor file can be used as a starting point with reference to the original document. In either case, the Scan/OCR stages are pretty much just a question of getting round to them and then letting the computer run. The proofreading stage is a project in its own right.

No comments: