My workflow for almost djvubind-equivalent PDFs...

ahmad · Post by **ahmad** » 04 Jan 2011, 18:33

djvubind is a brilliant application. However, the djvu format is not as widespread as one would like, and when distributing, it's nice to have a PDF backup plan. So I thought I'd share with you my technique for getting almost djvubind-equivalent results with PDF. I say 'almost' equivalent because the filesize is still about 200% of an equivalent djvu. And, of course, it's a much more involved process...

This guide assumes you are running a bash shell on a UNIX-ish OS. So, without further ado, here are the steps:

OCR Scan Tailor output TIFFs using Cuneiform 0.8*. You can use Tesseract for this, but I find Cuneiform to be slightly more accurate with 'bad' input.
*It has to be version 0.8 as this is the latest version sans new HOCR formatting which breaks hocr2pdf compatibility - https://bugs.launchpad.net/cuneiform-linux/+bug/623438
Code: Select all
```
for img in *.tif; do cuneiform -f hocr -o $img.hocr $img; done
```
Create hidden-text-layer PDFs for each page using hocr2pdf
Code: Select all
```
for img in *.tif; do hocr2pdf -i $img -o $img.pdf < $img.hocr; done
```
Convert our cover page JPG into a PDF using ImageMagick. Replace <>s with your desired DPI values.
Code: Select all
```
convert cover_front.jpg cover_front.pdf -density <input jpeg DPI> -filter mitchell -resample <output pdf DPI>
```
Merge together our resulting PDFs
Code: Select all
```
pdfconcat -o merged.pdf ./*.pdf
```
Re-compress our PDF's images using pdfsizeopt (found at http://code.google.com/p/pdfsizeopt/)
Code: Select all
```
python pdfsizeopt.py --use-pngout=false --use-jbig2=true --use-multivalent=false merged.pdf
```

Extract metadata using pdftk

Code: Select all

pdftk merged.pso.pdf dump_data output pdf_meta.txt

Edit pdf_meta.txt in your editor of choice; guide to allowed parameters here (under 'A couple of infrequently used options'): http://www.linux.com/archive/feed/53701

Merge-in edited metadata

Code: Select all

pdftk merged.pso.pdf update_info pdf_meta.txt output book.pdf

And we're done!

It took me a long time to get to a method which worked satisfactarily. I was trying to bypass the individual pdf-creation, merging and compression using pdfbeads but I couldn't get rmagick to compile on either of my systems. Out of interest, I did manage to install it on a Windows box; in terms of file-size, this method seems just as good

And, in my mind, my method wins on philosophical grounds - it's much more UNIX-like

reggilbert · Post by **reggilbert** » 04 Jan 2011, 19:48

Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat? Importing (combining in Acrobat lingo) is just a few clicks and perhaps three minutes' wait for a 300-page book. Acrobat OCR is about ten minutes' wait for such a book. Surely every day that goes by makes the difference between a 20Mb output and an 8Mb output for a complete book less significant.

And while I have anyone's attention, what is the value-added of ScanTailor for DIY scanner jobs that do not need Tulon's new dewarping function? I have used the program to beautifully split two-page flatbed scans, but what's the value for single-page DIY scanner scans? Acrobat can do bulk cropping -- maybe four batch moves, two minutes' total time, are required for your standard books -- and Acrobat can perform excellent OCR on greyscale images (superior to its OCR on bitonal images). So can't using Acrobat to make books skip the ScanTailor step too? That also avoids the sometimes problematic conversion to black and white that I gather is necessary for ScanTailor's attached OCR function.

Thanks for any enlightenment.

Post by **daniel_reetz** » 04 Jan 2011, 20:35

reggilbert wrote:Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat?

Not everyone has Acrobat, can afford Acrobat, or likes to use proprietary software.

No Acrobat on Linux, period.

Acrobat is not scriptable. These things may look complicated -- and they are -- but on platforms like Linux, procedures like these automate the process completely in the form of scripts.

reggilbert wrote:And while I have anyone's attention, what is the value-added of ScanTailor for DIY scanner jobs that do not need Tulon's new dewarping function? I have used the program to beautifully split two-page flatbed scans, but what's the value for single-page DIY scanner scans? Acrobat can do bulk cropping -- maybe four batch moves, two minutes' total time, are required for your standard books -- and Acrobat can perform excellent OCR on greyscale images (superior to its OCR on bitonal images). So can't using Acrobat to make books skip the ScanTailor step too? That also avoids the sometimes problematic conversion to black and white that I gather is necessary for ScanTailor's attached OCR function.

Thanks for any enlightenment.

Same answers as above, except that Scan Tailor can also remove lighting artifacts, automatically add margins, and deals very neatly with two-camera scanner output.

Second answer: Different strokes for different folks. Some people here do no post-processing. Some people do a lot. Sure, you can batch crop, but that's not everything. Scan Tailor measurably improves book scanning across the board. If you have an Acrobat-based workflow, that's great, you should post a separate thread describing your procedure for people who have Acrobat, can afford it, or want to use it.

Post by **daniel_reetz** » 04 Jan 2011, 20:41

Ahmad, this is really cool work, I hope to re-use your commands on Ubuntu... have you seen Misty's program? http://diybookscanner.org/forum/viewtop ... 7300#p7138

rob · Post by **rob** » 04 Jan 2011, 21:06

That's extremely awesome, ahmad! I'll definitely plan to use this, but since I have Acrobat, I'll probably end up replacing the recompression part with Adobe's ClearScan. It does basically what djvu with a book-wide dictionary does (replaces near-similar letters with an index into a table). I've been able to get PDFs down to the same size as a djvu image that way.

Anyway, I've stickied this post because it's so useful.

strider1551 · Post by **strider1551** » 04 Jan 2011, 21:37

Great work, ahmad. It looks very elegant.

rob wrote:I'll probably end up replacing the recompression part with Adobe's ClearScan. It does basically what djvu with a book-wide dictionary does (replaces near-similar letters with an index into a table). I've been able to get PDFs down to the same size as a djvu image that way.

I thought that ClearScan created a custom vectored font? I suppose in an abstract way that's similar to a shared dictionary, but going from raster to vectored is a huge difference... kinda like how apples and oranges are both good snacks.

rob · Post by **rob** » 04 Jan 2011, 22:07

Oh, you are right about that -- I forgot. That means the document is scalable with little loss in legibility. As ceeann said in another thread, useful for those of us who are going to get old before age-related macular degeneration can be fixed.

E^3 · Post by **E^3** » 05 Jan 2011, 01:29

Hello folks..

Nice Ahmmad, great works , I am now starting to embed it in my Gambas script

Thanks

E^3

Anonymous1 · Post by **Anonymous1** » 12 Jan 2011, 01:23

strider1551 wrote:Great work, ahmad. It looks very elegant.

rob wrote:I'll probably end up replacing the recompression part with Adobe's ClearScan. It does basically what djvu with a book-wide dictionary does (replaces near-similar letters with an index into a table). I've been able to get PDFs down to the same size as a djvu image that way.
I thought that ClearScan created a custom vectored font? I suppose in an abstract way that's similar to a shared dictionary, but going from raster to vectored is a huge difference... kinda like how apples and oranges are both good snacks.

Whoa. Is there any other software to do that? I was thinking about scripting this, but I had no idea that it already existed. Wow. So is each letter a copy of a master letter, or is it not that complicated? I didn't know that ClearScan did this...

JonEP · Post by **JonEP** » 12 Jan 2011, 10:43

Dear Ahmad, and all of you out there who are capable scripters,

Thanks for working on this. Is it possible, I wonder, that this work might eventually lead to some sort of GUI-based program that could be run on a Windows machine, somehow? Or are scripts like this destined to remain in the linux world? I wish I were able to make sense of, and use these tools, install linux as a dual boot on my computer, etc., but alas, I'll never be able to justify the time it would take to get to that level.

RE: Regilbert's question:

Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat?

The file size issue is a really big one, were you to go down the Acrobat road. Especially if you are outputting "mixed" image and text from Scan Tailor, at 600 dpi. In that case, you are talking about a 250 page book taking up hundreds of MB. It becomes not just a hard-disk storage problem, but an entire issue for transportability (want to upload one of those big files onto a tablet computer using a 3G connection?)

DIY Book Scanner

My workflow for almost djvubind-equivalent PDFs...

My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...