Profile picture
Huddersfield Exposed @HuddersExposed
, 13 tweets, 5 min read Read on Twitter
All 604 pages of the 1900 Huddersfield & District Postal Directory scanned and deskewed :-) I'm guesstimating a further 50 hours work (~5 mins per page) to get to the final PDF version. Time to stick the kettle on...
Just in case anyone's interested in the workflow...

Step 1) Scan the pages! I use a specialist book scanner (Plustek OpticBook 4600) which lets me lay the page completely flat and scans right up to the edge. Scans are done at 600 DPI. It took about 10 hours to scan the book.
Step 2) Crop the pages. Rather than do this manually, I use the batch processing tool in Paint Shop Pro. Takes about 15 minutes to processes the 600 pages.
Step 3) Deskew the pages. Try as you might, it's difficult to always get a perfectly aligned page when you scan. Deskewing (which took about an hour for 600 pages) tries to ensure the text is perfectly horizontal.
At this point, I've got 604 cropped & deskewed page images saved in lossless PNG format, each about 15MB in size, so about 8.2GB in total (or enough to fill 2 DVD-ROMs!). No-one wants to download a 8GB PDF, so we need to considerably decrease the file sizes.
One option is to resize the images (to decrease the resolution) and to save them as lossy JPEGs, but you end up with page scans in the final PDF which lose clarity when you zoom in.
My preference is to try and retain as much of the quality of the original scan as possible, so...

Step 4) Automatically process the image to reduce it to 1 bit colour per pixel (ie B&W). This takes some trial and error to find the optimal settings to leave clear, readable text.
Here's what the text looks like now. At this stage, I'll check how well certain letters look and I might re-run step 4 several times with different settings to see if I can improve the clarity...
One of the issues to tackle with this particular book are the vertical printing lines between some letters -- e.g. between the E and S of James. If I'm not careful, step 4 will make them stand out even more and that'll cause problems with the OCR stage :-S
Step 5) Digital restoration. This is the time consuming stage, but well worth doing. Old books are full of age spots and printing artefacts. If I've done step 4 well, it should minimise some of those. A keen eye and a steady hand are needed! Here's a before and after...
I reckon I'll average about 5 minutes per page doing the restoration. On problematic pages, I might end up having to recomposite individual letters and words to improve clarity (see the E of Joe in those previous images).
By now, the page images are still at their original DPI scan resolution but the file size has dropped from around 15,360KB (i.e. 15MB) per page to around 400KB, so suitable for using in a PDF file.
The last stage is to create the final PDF. I'll do some testing to see which of the software packages I've got does the best job of converting the images back into OCR'd text (so that the PDF is searchable). Tesseract is often the winner en.wikipedia.org/wiki/Tesseract…
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Huddersfield Exposed
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!