My workflow for almost djvubind-equivalent PDFs...

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

User avatar
Gerard
Posts: 154
Joined: 17 Oct 2010, 07:15
Number of books owned: 0
Location: Berlin (Germany)

Re: My workflow for almost djvubind-equivalent PDFs...

Post by Gerard »

you could try andlinux oder colinux which is a linux environment in windows, maybe cygwin is also an option
Anonymous1

Re: My workflow for almost djvubind-equivalent PDFs...

Post by Anonymous1 »

JonEP wrote:Dear Ahmad, and all of you out there who are capable scripters,

Thanks for working on this. Is it possible, I wonder, that this work might eventually lead to some sort of GUI-based program that could be run on a Windows machine, somehow? Or are scripts like this destined to remain in the linux world? I wish I were able to make sense of, and use these tools, install linux as a dual boot on my computer, etc., but alas, I'll never be able to justify the time it would take to get to that level.

RE: Regilbert's question:
Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat?
The file size issue is a really big one, were you to go down the Acrobat road. Especially if you are outputting "mixed" image and text from Scan Tailor, at 600 dpi. In that case, you are talking about a 250 page book taking up hundreds of MB. It becomes not just a hard-disk storage problem, but an entire issue for transportability (want to upload one of those big files onto a tablet computer using a 3G connection?)
Well, I've been hacking together a GUI for this sort of stuff for about a week now. You can check out the development thread here: http://www.diybookscanner.org/forum/vie ... ?f=3&t=794

I've got djvubind fully implemented, and I think ahmad's PDF approach will be what I'll use for PDF binding. Thanks ahmad ;)
Anonymous1

Re: My workflow for almost djvubind-equivalent PDFs...

Post by Anonymous1 »

Gerard wrote:you could try andlinux oder colinux which is a linux environment in windows, maybe cygwin is also an option
I think all of this can run on Windows. I don't use Windows often, but all of the software has Windows binaries, so the only limitation is Command Prompt. Python can fix that, though.
zamacam
Posts: 20
Joined: 04 Mar 2014, 00:53

Re: My workflow for almost djvubind-equivalent PDFs...

Post by zamacam »

rob wrote:That's extremely awesome, ahmad! I'll definitely plan to use this, but since I have Acrobat, I'll probably end up replacing the recompression part with Adobe's ClearScan.
Well, if you use ClearScan, I think you will loose your original text recognition. Acrobat will replace it with its own. This is a pity because Adobe's OCR is not as good as dedicated OCR software (at least in french, Adobe OCR does an awfull job).

An "ideal" solution could be to merge Clearscan's output and hidden text layer PDF generated by dedicated OCR software. Reading ahmad's message I think it is possible but I don't know how yet.
Last edited by Anonymous on 13 Jan 2011, 12:55, edited 1 time in total.
zamacam
Posts: 20
Joined: 04 Mar 2014, 00:53

Re: My workflow for almost djvubind-equivalent PDFs...

Post by zamacam »

ahmad wrote: [*]Re-compress our PDF's images using pdfsizeopt (found at http://code.google.com/p/pdfsizeopt/)

Code: Select all

python pdfsizeopt.py --use-pngout=false --use-jbig2=true --use-multivalent=false merged.pdf
ahmad, just curious: can you see a loss in image quality when you use JBIG2 compression with your program? I'm asking because I can see one when I use it in Acrobat X, even if I choose "lossless" compression.
ahmad
Posts: 24
Joined: 28 Dec 2010, 11:26

Re: My workflow for almost djvubind-equivalent PDFs...

Post by ahmad »

Wow Anonymous, that GUI is looking awesome!
zamacam wrote:
ahmad wrote: [*]Re-compress our PDF's images using pdfsizeopt (found at http://code.google.com/p/pdfsizeopt/)

Code: Select all

python pdfsizeopt.py --use-pngout=false --use-jbig2=true --use-multivalent=false merged.pdf
ahmad, just curious: can you see a loss in image quality when you use JBIG2 compression with your program? I'm asking because I can see one when I use it in Acrobat X, even if I choose "lossless" compression.
I'm assuming your input images are already bitonal?

The actual conversion seems lossless in my case, but some viewers (eg. Evince) don't seem to resample the jbig2 images very well. I have to use acroread to see a 'readable' image..
zamacam
Posts: 20
Joined: 04 Mar 2014, 00:53

Re: My workflow for almost djvubind-equivalent PDFs...

Post by zamacam »

ahmad wrote: I'm assuming your input images are already bitonal?

The actual conversion seems lossless in my case, but some viewers (eg. Evince) don't seem to resample the jbig2 images very well. I have to use acroread to see a 'readable' image..
Yes, you are right. I tried to downsize PDF files generated by Abby Fine Reader for Mac. As input to Abby, I'm using bitonal files generated by ScanTailor. It gives me nice PDF files, with high OCR accuracy and a reasonable size but I read in this forum I can downsize a little more these files since Abby seems to use CCITT Group 4.

I'm not sure about viewer's problem since I'm using mostly Preview on Mac OS 10.6. I don't remember to have a problem with it to view JBIG2 files from other sources. But, well, thanks for the hint. I will try with other viewers to be sure.

I'm very interested by your workflow. It's nice and easy to set up. My only concern is about OCR accuracy recognition. Are you pleased by Cuneiform's accuracy. Did you need to fix many mistake to obtain good results?
Shaknum
Posts: 91
Joined: 16 Aug 2010, 13:10

Re: My workflow for almost djvubind-equivalent PDFs...

Post by Shaknum »

zamacam wrote:An "ideal" solution could be to merge Clearscan's output and hidden text layer PDF generated by dedicated OCR software. Reading ahmad's message I think it is possible but I don't know how yet.
I fooled around with this for a while a few months back, but got nowhere. It seems the clear scanned portions of the page are not an image, and I couldn't figure out how to separate the clear scan vectorized "glyphs" from the OCRed text layer. I admit I don't know much about these things and would absolutely love to hear about any progress you make on that front. It really would be a benefit, the clear scanned page is tiny and mildly-to-far superior in image quality, it's just that darn untrainable OCR engine that stands in the way of a perfect solution.
La_Tristesse
Posts: 11
Joined: 18 Jun 2011, 21:47

Re: My workflow for almost djvubind-equivalent PDFs...

Post by La_Tristesse »

Does someone use Mac OS X for pdfsizeopt? I run into some trouble regarding the jbig2enc binary I compiled from Mistys Homebrew Forumla. See the issue for further information: http://code.google.com/p/pdfsizeopt/issues/detail?id=49
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: My workflow for almost djvubind-equivalent PDFs...

Post by Misty »

I'll take a look and see if I can figure anything out.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
Post Reply