My workflow for almost djvubind-equivalent PDFs...

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

ahmad
Posts: 24
Joined: 28 Dec 2010, 11:26

My workflow for almost djvubind-equivalent PDFs...

Post by ahmad »

djvubind is a brilliant application. However, the djvu format is not as widespread as one would like, and when distributing, it's nice to have a PDF backup plan. So I thought I'd share with you my technique for getting almost djvubind-equivalent results with PDF. I say 'almost' equivalent because the filesize is still about 200% of an equivalent djvu. And, of course, it's a much more involved process...

This guide assumes you are running a bash shell on a UNIX-ish OS. So, without further ado, here are the steps:
  • OCR Scan Tailor output TIFFs using Cuneiform 0.8*. You can use Tesseract for this, but I find Cuneiform to be slightly more accurate with 'bad' input.
    *It has to be version 0.8 as this is the latest version sans new HOCR formatting which breaks hocr2pdf compatibility - https://bugs.launchpad.net/cuneiform-linux/+bug/623438

    Code: Select all

    for img in *.tif; do cuneiform -f hocr -o $img.hocr $img; done
  • Create hidden-text-layer PDFs for each page using hocr2pdf

    Code: Select all

    for img in *.tif; do hocr2pdf -i $img -o $img.pdf < $img.hocr; done
  • Convert our cover page JPG into a PDF using ImageMagick. Replace <>s with your desired DPI values.

    Code: Select all

    convert cover_front.jpg cover_front.pdf -density <input jpeg DPI> -filter mitchell -resample <output pdf DPI>
  • Merge together our resulting PDFs

    Code: Select all

    pdfconcat -o merged.pdf ./*.pdf
  • Re-compress our PDF's images using pdfsizeopt (found at http://code.google.com/p/pdfsizeopt/)

    Code: Select all

    python pdfsizeopt.py --use-pngout=false --use-jbig2=true --use-multivalent=false merged.pdf
  • Extract metadata using pdftk

    Code: Select all

    pdftk merged.pso.pdf dump_data output pdf_meta.txt
  • Edit pdf_meta.txt in your editor of choice; guide to allowed parameters here (under 'A couple of infrequently used options'): http://www.linux.com/archive/feed/53701
  • Merge-in edited metadata

    Code: Select all

    pdftk merged.pso.pdf update_info pdf_meta.txt output book.pdf
And we're done!

It took me a long time to get to a method which worked satisfactarily. I was trying to bypass the individual pdf-creation, merging and compression using pdfbeads but I couldn't get rmagick to compile on either of my systems. Out of interest, I did manage to install it on a Windows box; in terms of file-size, this method seems just as good :)

And, in my mind, my method wins on philosophical grounds - it's much more UNIX-like :D
User avatar
reggilbert
Posts: 49
Joined: 28 Sep 2010, 19:57
Number of books owned: 3000
Location: Buffalo, New York

Re: My workflow for almost djvubind-equivalent PDFs...

Post by reggilbert »

Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat? Importing (combining in Acrobat lingo) is just a few clicks and perhaps three minutes' wait for a 300-page book. Acrobat OCR is about ten minutes' wait for such a book. Surely every day that goes by makes the difference between a 20Mb output and an 8Mb output for a complete book less significant.

And while I have anyone's attention, what is the value-added of ScanTailor for DIY scanner jobs that do not need Tulon's new dewarping function? I have used the program to beautifully split two-page flatbed scans, but what's the value for single-page DIY scanner scans? Acrobat can do bulk cropping -- maybe four batch moves, two minutes' total time, are required for your standard books -- and Acrobat can perform excellent OCR on greyscale images (superior to its OCR on bitonal images). So can't using Acrobat to make books skip the ScanTailor step too? That also avoids the sometimes problematic conversion to black and white that I gather is necessary for ScanTailor's attached OCR function.

Thanks for any enlightenment.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: My workflow for almost djvubind-equivalent PDFs...

Post by daniel_reetz »

reggilbert wrote:Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat?
Not everyone has Acrobat, can afford Acrobat, or likes to use proprietary software.

No Acrobat on Linux, period.

Acrobat is not scriptable. These things may look complicated -- and they are -- but on platforms like Linux, procedures like these automate the process completely in the form of scripts.

reggilbert wrote:And while I have anyone's attention, what is the value-added of ScanTailor for DIY scanner jobs that do not need Tulon's new dewarping function? I have used the program to beautifully split two-page flatbed scans, but what's the value for single-page DIY scanner scans? Acrobat can do bulk cropping -- maybe four batch moves, two minutes' total time, are required for your standard books -- and Acrobat can perform excellent OCR on greyscale images (superior to its OCR on bitonal images). So can't using Acrobat to make books skip the ScanTailor step too? That also avoids the sometimes problematic conversion to black and white that I gather is necessary for ScanTailor's attached OCR function.

Thanks for any enlightenment.
Same answers as above, except that Scan Tailor can also remove lighting artifacts, automatically add margins, and deals very neatly with two-camera scanner output.

Second answer: Different strokes for different folks. Some people here do no post-processing. Some people do a lot. Sure, you can batch crop, but that's not everything. Scan Tailor measurably improves book scanning across the board. If you have an Acrobat-based workflow, that's great, you should post a separate thread describing your procedure for people who have Acrobat, can afford it, or want to use it.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: My workflow for almost djvubind-equivalent PDFs...

Post by daniel_reetz »

Ahmad, this is really cool work, I hope to re-use your commands on Ubuntu... have you seen Misty's program? http://diybookscanner.org/forum/viewtop ... 7300#p7138
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: My workflow for almost djvubind-equivalent PDFs...

Post by rob »

That's extremely awesome, ahmad! I'll definitely plan to use this, but since I have Acrobat, I'll probably end up replacing the recompression part with Adobe's ClearScan. It does basically what djvu with a book-wide dictionary does (replaces near-similar letters with an index into a table). I've been able to get PDFs down to the same size as a djvu image that way.

Anyway, I've stickied this post because it's so useful.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: My workflow for almost djvubind-equivalent PDFs...

Post by strider1551 »

Great work, ahmad. It looks very elegant.
rob wrote:I'll probably end up replacing the recompression part with Adobe's ClearScan. It does basically what djvu with a book-wide dictionary does (replaces near-similar letters with an index into a table). I've been able to get PDFs down to the same size as a djvu image that way.
I thought that ClearScan created a custom vectored font? I suppose in an abstract way that's similar to a shared dictionary, but going from raster to vectored is a huge difference... kinda like how apples and oranges are both good snacks.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: My workflow for almost djvubind-equivalent PDFs...

Post by rob »

Oh, you are right about that -- I forgot. That means the document is scalable with little loss in legibility. As ceeann said in another thread, useful for those of us who are going to get old before age-related macular degeneration can be fixed.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
E^3
Posts: 41
Joined: 12 Jul 2010, 21:06

Re: My workflow for almost djvubind-equivalent PDFs...

Post by E^3 »

Hello folks..

Nice Ahmmad, great works , I am now starting to embed it in my Gambas script


Thanks

E^3
Anonymous1

Re: My workflow for almost djvubind-equivalent PDFs...

Post by Anonymous1 »

strider1551 wrote:Great work, ahmad. It looks very elegant.
rob wrote:I'll probably end up replacing the recompression part with Adobe's ClearScan. It does basically what djvu with a book-wide dictionary does (replaces near-similar letters with an index into a table). I've been able to get PDFs down to the same size as a djvu image that way.
I thought that ClearScan created a custom vectored font? I suppose in an abstract way that's similar to a shared dictionary, but going from raster to vectored is a huge difference... kinda like how apples and oranges are both good snacks.
Whoa. Is there any other software to do that? I was thinking about scripting this, but I had no idea that it already existed. Wow. So is each letter a copy of a master letter, or is it not that complicated? I didn't know that ClearScan did this...
User avatar
JonEP
Posts: 81
Joined: 19 Apr 2010, 15:09

Re: My workflow for almost djvubind-equivalent PDFs...

Post by JonEP »

Dear Ahmad, and all of you out there who are capable scripters,

Thanks for working on this. Is it possible, I wonder, that this work might eventually lead to some sort of GUI-based program that could be run on a Windows machine, somehow? Or are scripts like this destined to remain in the linux world? I wish I were able to make sense of, and use these tools, install linux as a dual boot on my computer, etc., but alas, I'll never be able to justify the time it would take to get to that level.

RE: Regilbert's question:
Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat?
The file size issue is a really big one, were you to go down the Acrobat road. Especially if you are outputting "mixed" image and text from Scan Tailor, at 600 dpi. In that case, you are talking about a 250 page book taking up hundreds of MB. It becomes not just a hard-disk storage problem, but an entire issue for transportability (want to upload one of those big files onto a tablet computer using a 3G connection?)
Post Reply