Adding positionally aware ocr to a djvu scan
Moderator: peterZ
- strider1551
- Posts: 126
- Joined: 01 Mar 2010, 11:39
- Number of books owned: 0
- Location: Ohio, USA
Adding positionally aware ocr to a djvu scan
This started innocently enough when I was looking for a way to add ocr text to a djvu file from a linux command line. Thanks to a random blog page, I found working code that could take the boxing information from tesseract and put it into a format that the djvu commands would understand. I offer here a hastily put together proof-of-concept that can take a djvu file and add ocr information complete with word positions. This means that when you search for a word, you can be shown exactly where it is on the page:
=====
Files Dependencies
I threw this together in a Linux environment, and I doubt this proof-of-concept will work elsewhere, but who knows.
- djvulibre commands (ddjvu and djvused)
- tesseract
- perl and python
Howto
Place the two source files (format.pl and djvu-ocr.py) in a directory, along with the single djvu file you want to test this out on. Once that is in place, simply open a terminal to that directory and run "python ./djvu-ocr.py". Once it finishes the original djvu file should be updated with the ocr information.
=====
I will definitely be taking the time to make a respectable command-line program out of this for my own use. If there's any interest, I will add it to this thread.
=====
Files Dependencies
I threw this together in a Linux environment, and I doubt this proof-of-concept will work elsewhere, but who knows.
- djvulibre commands (ddjvu and djvused)
- tesseract
- perl and python
Howto
Place the two source files (format.pl and djvu-ocr.py) in a directory, along with the single djvu file you want to test this out on. Once that is in place, simply open a terminal to that directory and run "python ./djvu-ocr.py". Once it finishes the original djvu file should be updated with the ocr information.
=====
I will definitely be taking the time to make a respectable command-line program out of this for my own use. If there's any interest, I will add it to this thread.
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: Adding positionally aware ocr to a djvu scan
That's a really great feature. I know my friend Andy Filer once tried to do that with Holt Weekly News, where he scanned the entire run of a small town newspaper and tried to make it searchable online.
Try this link for an example:
http://holtweeklynews.com/images/show/3 ... text=books
(you can see that the text box is a little off).
Thanks for jumping right in with the cool software efforts.
Try this link for an example:
http://holtweeklynews.com/images/show/3 ... text=books
(you can see that the text box is a little off).
Thanks for jumping right in with the cool software efforts.
Re: Adding positionally aware ocr to a djvu scan
OCRopus's hOCR format is good for that, too. It stores positional information in an HTML-based file format, which can then be used with other software to convert into PDFs/other formats with embedded highlightable OCR. ABBYY, which is commercial software, also does that.
Art Rhyno, one of the owners of the Essex Free Press community newspaper, has been working on scanning his archives using ABBYY for positional info - you can see examples here. The results are quite accurate, moreso than in Dan's examples. He showed an example of it with OCRopus at a session I went to back in June, so it looks like that could be a workable option too.
Art Rhyno, one of the owners of the Essex Free Press community newspaper, has been working on scanning his archives using ABBYY for positional info - you can see examples here. The results are quite accurate, moreso than in Dan's examples. He showed an example of it with OCRopus at a session I went to back in June, so it looks like that could be a workable option too.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: Adding positionally aware ocr to a djvu scan
Great stuff, Misty! You know, I swear that's what a couple other projects were using, too... maybe the BKRPR people? I saw a demo once of a firefox browser that showed texts line-by-line with the scans. Also, I know Rob knows a bit about how Distributed Proofreading does things...
Re: Adding positionally aware ocr to a djvu scan
It might be out of date, but they say that they plan on implementing OCRopus. It looks like the OCR hasn't actually been integrated into their software yet.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
- rob
- Posts: 773
- Joined: 03 Jun 2009, 13:50
- E-book readers owned: iRex iLiad, Kindle 2
- Number of books owned: 4000
- Country: United States
- Location: Maryland, United States
- Contact:
Re: Adding positionally aware ocr to a djvu scan
Well, DP doesn't use the page images in the final product, since they feed Project Gutenberg, which is txt, html, or epub -- and I think tex for math.
Google books definitely does the same thing.
The only thing I hate about positional OCR is that if the OCR is off, you still can't search for your term. I would much rather get a clean OCR...
--Rob
Google books definitely does the same thing.
The only thing I hate about positional OCR is that if the OCR is off, you still can't search for your term. I would much rather get a clean OCR...
--Rob
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
Re: Adding positionally aware ocr to a djvu scan
Thanks Strider1551, very useful! I've now made a windows script that follows the same steps (using the autohotkey language). It controls djvulibre, imagemagick, strawberry perl + the perl script above and tesseract. Takes scan tailor tiff files as input and outputs an OCR'ed djvu file. All with only one manual step. I'll post it after some more testdriving.
- strider1551
- Posts: 126
- Joined: 01 Mar 2010, 11:39
- Number of books owned: 0
- Location: Ohio, USA
Re: Adding positionally aware ocr to a djvu scan
Looks like your one step ahead of me. I reworked everything into a python script (attached for what it's worth), but it still relies on something else creating the djvu file - it just adds ocr from either the images in the djvu file or from another directory.Takes scan tailor tiff files as input and outputs an OCR'ed djvu file. All with only one manual step.
One important thing worth changing is the tesseract options that the blog used. Replacing "batch.nochop" with "batch" has resulted in far more accurate ocr text. I only wish tesseract had some documentation that could explain what "nochop" tries to do.
Re: Adding positionally aware ocr to a djvu scan
Ok, I'll try switching to "batch" for some testruns and compare it. I'm completely new to tesseract (and to OCR in general) so I don't know what the commands are supposed to do differently.
Here's the steps I work through:
1. loop for each tiff in folder:
- djvulibre: cjb2.exe make djvu
2. djvulibre (old): djvm.exe merge djvu
3. djvulibre: djvused.exe count pages
4. loop for each page:
- 4a. djvulibre: ddjvu.exe extract compressed tiff page
- 4b. imagemagick: convert.exe uncompress tiff
- 4c. tesseract.exe OCR to page_box
- 4d. tesseract.exe OCR to page_txt
- 4e. strawberry perl: format.pl make "word position list"
- 4f. djvulibre: djvused.exe import to djvu page
5. cleanup temp files
If possible I'd like to put 4c & 4d before 1 and then skip 4a & 4b completely. But I'm not sure how since the djvu creation step modifies the image. But the words have the same relative position on the modified image so it seems possible to somehow "recalibrate" the word positions, but I'm not sure how.
Here's the steps I work through:
1. loop for each tiff in folder:
- djvulibre: cjb2.exe make djvu
2. djvulibre (old): djvm.exe merge djvu
3. djvulibre: djvused.exe count pages
4. loop for each page:
- 4a. djvulibre: ddjvu.exe extract compressed tiff page
- 4b. imagemagick: convert.exe uncompress tiff
- 4c. tesseract.exe OCR to page_box
- 4d. tesseract.exe OCR to page_txt
- 4e. strawberry perl: format.pl make "word position list"
- 4f. djvulibre: djvused.exe import to djvu page
5. cleanup temp files
If possible I'd like to put 4c & 4d before 1 and then skip 4a & 4b completely. But I'm not sure how since the djvu creation step modifies the image. But the words have the same relative position on the modified image so it seems possible to somehow "recalibrate" the word positions, but I'm not sure how.
Re: Adding positionally aware ocr to a djvu scan
Tested and agreed! There's a big difference.strider1551 wrote:One important thing worth changing is the tesseract options that the blog used. Replacing "batch.nochop" with "batch" has resulted in far more accurate ocr text. I only wish tesseract had some documentation that could explain what "nochop" tries to do.