Adding positionally aware ocr to a djvu scan

strider1551 · Post by **strider1551** » 05 Mar 2010, 15:25

This started innocently enough when I was looking for a way to add ocr text to a djvu file from a linux command line. Thanks to a random blog page, I found working code that could take the boxing information from tesseract and put it into a format that the djvu commands would understand. I offer here a hastily put together proof-of-concept that can take a djvu file and add ocr information complete with word positions. This means that when you search for a word, you can be shown exactly where it is on the page:

example.jpg: Example of the final result in a document viewing program; (186.98 KiB) Downloaded 1917 times

=====

Files

source.zip: (1.25 KiB) Downloaded 985 times

Dependencies
I threw this together in a Linux environment, and I doubt this proof-of-concept will work elsewhere, but who knows.
- djvulibre commands (ddjvu and djvused)
- tesseract
- perl and python

Howto
Place the two source files (format.pl and djvu-ocr.py) in a directory, along with the single djvu file you want to test this out on. Once that is in place, simply open a terminal to that directory and run "python ./djvu-ocr.py". Once it finishes the original djvu file should be updated with the ocr information.

=====

I will definitely be taking the time to make a respectable command-line program out of this for my own use. If there's any interest, I will add it to this thread.

Post by **daniel_reetz** » 05 Mar 2010, 16:07

That's a really great feature. I know my friend Andy Filer once tried to do that with Holt Weekly News, where he scanned the entire run of a small town newspaper and tried to make it searchable online.

Try this link for an example:

http://holtweeklynews.com/images/show/3 ... text=books

(you can see that the text box is a little off).

Thanks for jumping right in with the cool software efforts.

Misty · Post by **Misty** » 05 Mar 2010, 16:21

OCRopus's hOCR format is good for that, too. It stores positional information in an HTML-based file format, which can then be used with other software to convert into PDFs/other formats with embedded highlightable OCR. ABBYY, which is commercial software, also does that.

Art Rhyno, one of the owners of the Essex Free Press community newspaper, has been working on scanning his archives using ABBYY for positional info - you can see examples here. The results are quite accurate, moreso than in Dan's examples. He showed an example of it with OCRopus at a session I went to back in June, so it looks like that could be a workable option too.

Post by **daniel_reetz** » 05 Mar 2010, 16:24

Great stuff, Misty! You know, I swear that's what a couple other projects were using, too... maybe the BKRPR people? I saw a demo once of a firefox browser that showed texts line-by-line with the scans. Also, I know Rob knows a bit about how Distributed Proofreading does things...

: friends.jpg (8.34 KiB) Viewed 18110 times

Misty · Post by **Misty** » 05 Mar 2010, 17:05

It might be out of date, but they say that they plan on implementing OCRopus. It looks like the OCR hasn't actually been integrated into their software yet.

rob · Post by **rob** » 05 Mar 2010, 23:53

Well, DP doesn't use the page images in the final product, since they feed Project Gutenberg, which is txt, html, or epub -- and I think tex for math.

Google books definitely does the same thing.

The only thing I hate about positional OCR is that if the OCR is off, you still can't search for your term. I would much rather get a clean OCR...

--Rob

dtic · Post by **dtic** » 27 Mar 2010, 07:36

Thanks Strider1551, very useful! I've now made a windows script that follows the same steps (using the autohotkey language). It controls djvulibre, imagemagick, strawberry perl + the perl script above and tesseract. Takes scan tailor tiff files as input and outputs an OCR'ed djvu file. All with only one manual step. I'll post it after some more testdriving.

strider1551 · Post by **strider1551** » 27 Mar 2010, 08:20

Takes scan tailor tiff files as input and outputs an OCR'ed djvu file. All with only one manual step.

Looks like your one step ahead of me. I reworked everything into a python script (attached for what it's worth), but it still relies on something else creating the djvu file - it just adds ocr from either the images in the djvu file or from another directory.

One important thing worth changing is the tesseract options that the blog used. Replacing "batch.nochop" with "batch" has resulted in far more accurate ocr text. I only wish tesseract had some documentation that could explain what "nochop" tries to do.

djvuocr.zip: (2.68 KiB) Downloaded 864 times

dtic · Post by **dtic** » 27 Mar 2010, 09:42

Ok, I'll try switching to "batch" for some testruns and compare it. I'm completely new to tesseract (and to OCR in general) so I don't know what the commands are supposed to do differently.

Here's the steps I work through:

1. loop for each tiff in folder:
- djvulibre: cjb2.exe make djvu
2. djvulibre (old): djvm.exe merge djvu
3. djvulibre: djvused.exe count pages
4. loop for each page:
- 4a. djvulibre: ddjvu.exe extract compressed tiff page
- 4b. imagemagick: convert.exe uncompress tiff
- 4c. tesseract.exe OCR to page_box
- 4d. tesseract.exe OCR to page_txt
- 4e. strawberry perl: format.pl make "word position list"
- 4f. djvulibre: djvused.exe import to djvu page
5. cleanup temp files

If possible I'd like to put 4c & 4d before 1 and then skip 4a & 4b completely. But I'm not sure how since the djvu creation step modifies the image. But the words have the same relative position on the modified image so it seems possible to somehow "recalibrate" the word positions, but I'm not sure how.

dtic · Post by **dtic** » 29 Mar 2010, 12:56

strider1551 wrote:One important thing worth changing is the tesseract options that the blog used. Replacing "batch.nochop" with "batch" has resulted in far more accurate ocr text. I only wish tesseract had some documentation that could explain what "nochop" tries to do.

Tested and agreed! There's a big difference.

DIY Book Scanner

Adding positionally aware ocr to a djvu scan

Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan

Re: Adding positionally aware ocr to a djvu scan