PDFBeads â€” Convert Scanned Images to a Single PDF File

loyukfai · Post by **loyukfai** » 29 Jun 2011, 01:49

knappen wrote:Could someone please give an example of the command line I should write to simply convert a folder of Scan Tailor converted files with text&images into a compressed PDF file?

I just ran "pdfbeads * > abc.pdf", without the quotes, inside the directory with the TIF files, and that's it...

knappen · Post by **knappen** » 29 Jun 2011, 11:32

Thanks!

I ran the command loyukfai gave and ended up with a PDF file that was in fact a lot smaller than the result I got using Acrobat Pro on the same folder.

BUT: The images also look A LOT more compressed and blurred. Not sure if this is because the batch of tiff images were a mix of greyscale/mixed/b&w encodings from Scan Tailor.

I still get the "Warning: the hpricot extension is not available. I'll not be able to create hidden text layer from hOCR files" message in the start and during the encoding process
"TIFFetchCirectory; TIFFstream: Can not read Tiff directory.
TIFFReadDirectory: TIFFStream: Failed to read directory at offset 0.
Error in findTiffCompression: tif not opened"
show up a lot.

loyukfai · Post by **loyukfai** » 29 Jun 2011, 12:25

The hpricot thing is related to OCR, I think the OP has mentioned that.

In short, the scanned images are, images, that cannot be searched. OCR recognize the characters and all things work together, make the final PDF "searchable".

I'm primarily focused on making a portable version of PDFBeads right now and haven't yet looked into the OCR thing.

Cheers.

knappen · Post by **knappen** » 29 Jun 2011, 12:34

Thanks again.

I'm less concerned that I have to do the OCR scan with another program than with the fact that I get such low quality pictures in the PDF. Is there a way to manually choose the level of compression?

A portable version of PDFBeads sounds great!

Misty · Post by **Misty** » 29 Jun 2011, 12:48

There may be. One of the goals of PDFBeads is to produce smaller-sized PDFs, so it might be defaulting to lower resolution image.

You can see the options by typing pdfbeads --help

The -B option chooses the resolution for images. Try experimenting with various options to see what looks good to you. According to the documentation, the default value is 300 (is that right?) - try other values like -B 400 or -B 600

knappen wrote:I still get the "Warning: the hpricot extension is not available. I'll not be able to create hidden text layer from hOCR files" message in the start and during the encoding process

This means you don't have the hpricot gem installed. You need that if you want OCR, otherwise it just means PDFBeads will bug you about it.

"TIFFetchCirectory; TIFFstream: Can not read Tiff directory.
TIFFReadDirectory: TIFFStream: Failed to read directory at offset 0.
Error in findTiffCompression: tif not opened"
show up a lot.

Are you using the RubyPDF build of jbig2enc? That's a bug in it - it will work fine, it'll just give you error messages.

knappen · Post by **knappen** » 29 Jun 2011, 14:19

I'm using the exact same software as suggested in the YouTube instruction video. Can't recall that the jbig2enc being a RubyPDF build.

I tried -B 600 and the result is indeed something completely different. The size was tripled compared to the default value, but still a tad smaller than with Acrobat Pro. Will try some lower values too and see what I prefer.

I'll try to be a little more independent from now on, but there is one thing that I would really like to get your opinion on: What is the best choice in Scan Tailor to get quality results in PDFBeads for pages with both text and images? With the mixed mode you get crisp b/w text, but an overexposed image, with greyscale you lose no image quality, but get text that is impossible to convert to vector graphic OCR; the white margins+equalized illumination option is a compromise not 100% satisfactory...
Would PDFBeads be a program that offers a better solution to this?

Misty · Post by **Misty** » 29 Jun 2011, 14:49

I just checked, and it is RubyPDF's build that is recommended in that video. You can safely ignore the error messages; they might be a little irritating, but they're harmless.

-B 600 is probably overkill unless you're viewing your PDFs at a pretty high res. Experiment, I'm sure you'll find a value that looks good and still gets you a decent filesize.

PDFBeads isn't intended as a binarizer - it's intended to work on images processed via Scan Tailor. Are you finding that Scan Tailor's lightening of your images is overzealous? Can you post an example, and what you'd like it to look like?

knappen · Post by **knappen** » 29 Jun 2011, 16:00

I'm talking about effects that you get in mixed mode like the one in the right bottom corner. I would obviously like it to be darker and not washed out.

I remember Tulon answering this in a Scan Tailor thread and saying that it would be difficult to solve the problem.

loyukfai · Post by **loyukfai** » 30 Jun 2011, 06:59

@knappen: Just FYI, your image is unviewable here.

@Misty: Off-topic... Are you not going to develop PDFMaker any further? Cause that's a native Windows app and could be much easier to make a portable version...

Cheers.

Misty · Post by **Misty** » 04 Jul 2011, 10:24

@loyukfai: Sorry, but I'm not planning to develop PDFMaker in its current incarnation. It was a pretty simple, slightly hacky script that I wrote before I knew any real programming. It was designed for Windows because I made it to scratch an itch at my Windows-centric employer; my current work environment is multi-platform, and I'm working in OS X. If I were to pick up PDFMaker again, it would be to rewrite it using a different programming language - and right now, that would probably be Ruby, which PDFBeads seems to have covered nicely.

One major thing I can see as improving for PDFBeads is installation on Windows. Coming from a Mac OS X environment, where Ruby is installed by default and dev tools are readily available, I wasn't aware how much harder it would be to get going on Windows. I'll see if I can help out getting a portable version of PDFBeads working.

DIY Book Scanner

PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File

Re: PDFBeads â€” Convert Scanned Images to a Single PDF File