Adding positionally aware ocr to a djvu scan

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Adding positionally aware ocr to a djvu scan

Post by strider1551 »

This started innocently enough when I was looking for a way to add ocr text to a djvu file from a linux command line. Thanks to a random blog page, I found working code that could take the boxing information from tesseract and put it into a format that the djvu commands would understand. I offer here a hastily put together proof-of-concept that can take a djvu file and add ocr information complete with word positions. This means that when you search for a word, you can be shown exactly where it is on the page:
example.jpg
Example of the final result in a document viewing program
(186.98 KiB) Downloaded 1917 times
=====

Files
source.zip
(1.25 KiB) Downloaded 985 times
Dependencies
I threw this together in a Linux environment, and I doubt this proof-of-concept will work elsewhere, but who knows.
- djvulibre commands (ddjvu and djvused)
- tesseract
- perl and python

Howto
Place the two source files (format.pl and djvu-ocr.py) in a directory, along with the single djvu file you want to test this out on. Once that is in place, simply open a terminal to that directory and run "python ./djvu-ocr.py". Once it finishes the original djvu file should be updated with the ocr information.

=====

I will definitely be taking the time to make a respectable command-line program out of this for my own use. If there's any interest, I will add it to this thread.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Adding positionally aware ocr to a djvu scan

Post by daniel_reetz »

That's a really great feature. I know my friend Andy Filer once tried to do that with Holt Weekly News, where he scanned the entire run of a small town newspaper and tried to make it searchable online.

Try this link for an example:

http://holtweeklynews.com/images/show/3 ... text=books

(you can see that the text box is a little off).

Thanks for jumping right in with the cool software efforts.
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Adding positionally aware ocr to a djvu scan

Post by Misty »

OCRopus's hOCR format is good for that, too. It stores positional information in an HTML-based file format, which can then be used with other software to convert into PDFs/other formats with embedded highlightable OCR. ABBYY, which is commercial software, also does that.

Art Rhyno, one of the owners of the Essex Free Press community newspaper, has been working on scanning his archives using ABBYY for positional info - you can see examples here. The results are quite accurate, moreso than in Dan's examples. He showed an example of it with OCRopus at a session I went to back in June, so it looks like that could be a workable option too.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Adding positionally aware ocr to a djvu scan

Post by daniel_reetz »

Great stuff, Misty! You know, I swear that's what a couple other projects were using, too... maybe the BKRPR people? I saw a demo once of a firefox browser that showed texts line-by-line with the scans. Also, I know Rob knows a bit about how Distributed Proofreading does things...
friends.jpg
friends.jpg (8.34 KiB) Viewed 18110 times
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Adding positionally aware ocr to a djvu scan

Post by Misty »

It might be out of date, but they say that they plan on implementing OCRopus. It looks like the OCR hasn't actually been integrated into their software yet.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Adding positionally aware ocr to a djvu scan

Post by rob »

Well, DP doesn't use the page images in the final product, since they feed Project Gutenberg, which is txt, html, or epub -- and I think tex for math.

Google books definitely does the same thing.

The only thing I hate about positional OCR is that if the OCR is off, you still can't search for your term. I would much rather get a clean OCR...

--Rob
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Adding positionally aware ocr to a djvu scan

Post by dtic »

Thanks Strider1551, very useful! I've now made a windows script that follows the same steps (using the autohotkey language). It controls djvulibre, imagemagick, strawberry perl + the perl script above and tesseract. Takes scan tailor tiff files as input and outputs an OCR'ed djvu file. All with only one manual step. I'll post it after some more testdriving.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Adding positionally aware ocr to a djvu scan

Post by strider1551 »

Takes scan tailor tiff files as input and outputs an OCR'ed djvu file. All with only one manual step.
Looks like your one step ahead of me. I reworked everything into a python script (attached for what it's worth), but it still relies on something else creating the djvu file - it just adds ocr from either the images in the djvu file or from another directory.

One important thing worth changing is the tesseract options that the blog used. Replacing "batch.nochop" with "batch" has resulted in far more accurate ocr text. I only wish tesseract had some documentation that could explain what "nochop" tries to do.
djvuocr.zip
(2.68 KiB) Downloaded 864 times
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Adding positionally aware ocr to a djvu scan

Post by dtic »

Ok, I'll try switching to "batch" for some testruns and compare it. I'm completely new to tesseract (and to OCR in general) so I don't know what the commands are supposed to do differently.

Here's the steps I work through:

1. loop for each tiff in folder:
- djvulibre: cjb2.exe make djvu
2. djvulibre (old): djvm.exe merge djvu
3. djvulibre: djvused.exe count pages
4. loop for each page:
- 4a. djvulibre: ddjvu.exe extract compressed tiff page
- 4b. imagemagick: convert.exe uncompress tiff
- 4c. tesseract.exe OCR to page_box
- 4d. tesseract.exe OCR to page_txt
- 4e. strawberry perl: format.pl make "word position list"
- 4f. djvulibre: djvused.exe import to djvu page
5. cleanup temp files

If possible I'd like to put 4c & 4d before 1 and then skip 4a & 4b completely. But I'm not sure how since the djvu creation step modifies the image. But the words have the same relative position on the modified image so it seems possible to somehow "recalibrate" the word positions, but I'm not sure how.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Adding positionally aware ocr to a djvu scan

Post by dtic »

strider1551 wrote:One important thing worth changing is the tesseract options that the blog used. Replacing "batch.nochop" with "batch" has resulted in far more accurate ocr text. I only wish tesseract had some documentation that could explain what "nochop" tries to do.
Tested and agreed! There's a big difference.
Post Reply