My workflow for almost djvubind-equivalent PDFs...

Gerard · Post by **Gerard** » 12 Jan 2011, 11:07

you could try andlinux oder colinux which is a linux environment in windows, maybe cygwin is also an option

Anonymous1 · Post by **Anonymous1** » 12 Jan 2011, 11:16

JonEP wrote:Dear Ahmad, and all of you out there who are capable scripters,

Thanks for working on this. Is it possible, I wonder, that this work might eventually lead to some sort of GUI-based program that could be run on a Windows machine, somehow? Or are scripts like this destined to remain in the linux world? I wish I were able to make sense of, and use these tools, install linux as a dual boot on my computer, etc., but alas, I'll never be able to justify the time it would take to get to that level.

RE: Regilbert's question:
Can someone explain to me the advantage, aside from file size, of complicated procedures like the above over simply importing page images into Acrobat?
The file size issue is a really big one, were you to go down the Acrobat road. Especially if you are outputting "mixed" image and text from Scan Tailor, at 600 dpi. In that case, you are talking about a 250 page book taking up hundreds of MB. It becomes not just a hard-disk storage problem, but an entire issue for transportability (want to upload one of those big files onto a tablet computer using a 3G connection?)

Well, I've been hacking together a GUI for this sort of stuff for about a week now. You can check out the development thread here: http://www.diybookscanner.org/forum/vie ... ?f=3&t=794

I've got djvubind fully implemented, and I think ahmad's PDF approach will be what I'll use for PDF binding. Thanks ahmad

Anonymous1 · Post by **Anonymous1** » 12 Jan 2011, 11:17

Gerard wrote:you could try andlinux oder colinux which is a linux environment in windows, maybe cygwin is also an option

I think all of this can run on Windows. I don't use Windows often, but all of the software has Windows binaries, so the only limitation is Command Prompt. Python can fix that, though.

zamacam · Post by **zamacam** » 13 Jan 2011, 12:48

rob wrote:That's extremely awesome, ahmad! I'll definitely plan to use this, but since I have Acrobat, I'll probably end up replacing the recompression part with Adobe's ClearScan.

Well, if you use ClearScan, I think you will loose your original text recognition. Acrobat will replace it with its own. This is a pity because Adobe's OCR is not as good as dedicated OCR software (at least in french, Adobe OCR does an awfull job).

An "ideal" solution could be to merge Clearscan's output and hidden text layer PDF generated by dedicated OCR software. Reading ahmad's message I think it is possible but I don't know how yet.

zamacam · Post by **zamacam** » 13 Jan 2011, 12:54

ahmad wrote: [*]Re-compress our PDF's images using pdfsizeopt (found at http://code.google.com/p/pdfsizeopt/)
Code: Select all
python pdfsizeopt.py --use-pngout=false --use-jbig2=true --use-multivalent=false merged.pdf

ahmad, just curious: can you see a loss in image quality when you use JBIG2 compression with your program? I'm asking because I can see one when I use it in Acrobat X, even if I choose "lossless" compression.

ahmad · Post by **ahmad** » 13 Jan 2011, 16:52

Wow Anonymous, that GUI is looking awesome!

zamacam wrote:
ahmad wrote: [*]Re-compress our PDF's images using pdfsizeopt (found at http://code.google.com/p/pdfsizeopt/)
Code: Select all
python pdfsizeopt.py --use-pngout=false --use-jbig2=true --use-multivalent=false merged.pdf
ahmad, just curious: can you see a loss in image quality when you use JBIG2 compression with your program? I'm asking because I can see one when I use it in Acrobat X, even if I choose "lossless" compression.

I'm assuming your input images are already bitonal?

The actual conversion seems lossless in my case, but some viewers (eg. Evince) don't seem to resample the jbig2 images very well. I have to use acroread to see a 'readable' image..

zamacam · Post by **zamacam** » 13 Jan 2011, 18:02

ahmad wrote: I'm assuming your input images are already bitonal?

The actual conversion seems lossless in my case, but some viewers (eg. Evince) don't seem to resample the jbig2 images very well. I have to use acroread to see a 'readable' image..

Yes, you are right. I tried to downsize PDF files generated by Abby Fine Reader for Mac. As input to Abby, I'm using bitonal files generated by ScanTailor. It gives me nice PDF files, with high OCR accuracy and a reasonable size but I read in this forum I can downsize a little more these files since Abby seems to use CCITT Group 4.

I'm not sure about viewer's problem since I'm using mostly Preview on Mac OS 10.6. I don't remember to have a problem with it to view JBIG2 files from other sources. But, well, thanks for the hint. I will try with other viewers to be sure.

I'm very interested by your workflow. It's nice and easy to set up. My only concern is about OCR accuracy recognition. Are you pleased by Cuneiform's accuracy. Did you need to fix many mistake to obtain good results?

Shaknum · Post by **Shaknum** » 13 Jan 2011, 21:38

zamacam wrote:An "ideal" solution could be to merge Clearscan's output and hidden text layer PDF generated by dedicated OCR software. Reading ahmad's message I think it is possible but I don't know how yet.

I fooled around with this for a while a few months back, but got nowhere. It seems the clear scanned portions of the page are not an image, and I couldn't figure out how to separate the clear scan vectorized "glyphs" from the OCRed text layer. I admit I don't know much about these things and would absolutely love to hear about any progress you make on that front. It really would be a benefit, the clear scanned page is tiny and mildly-to-far superior in image quality, it's just that darn untrainable OCR engine that stands in the way of a perfect solution.

La_Tristesse · Post by **La_Tristesse** » 09 Aug 2011, 12:12

Does someone use Mac OS X for pdfsizeopt? I run into some trouble regarding the jbig2enc binary I compiled from Mistys Homebrew Forumla. See the issue for further information: http://code.google.com/p/pdfsizeopt/issues/detail?id=49

Misty · Post by **Misty** » 09 Aug 2011, 12:50

I'll take a look and see if I can figure anything out.

DIY Book Scanner

My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...

Re: My workflow for almost djvubind-equivalent PDFs...