Page 1 of 5

PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 14 Nov 2010, 14:23
by Lazy_Kent
Alexey Kryukov wrote a ruby program PDFBeads.
http://rubyforge.org/projects/pdfbeads/

It uses JBIG2 and JPEG2000 encoding. Output PDF file is very small.
Example (301 pages, 4.1 Mb): http://narod.ru/disk/27466236000/network.pdf.html

Optionally PDFBeads adds hOCR to PDF.

You need Ruby with RubyGems, ImageMagick, jbig2enc.

In Linux:

Code: Select all

gem install rmagick
gem install pdfbeads
In Windows you should install Windows versions of programs.

For OCR put *.html or *.hocr in hOCR format for every scan into the same directory. Also you need install hpricot.

Manual in Russian only:
http://rubyforge.org/docman/view.php/97 ... beads.html

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 29 May 2011, 00:05
by lupocos
hello!
I managed to install pdfbeads as well as cuneiform on Windows.
Both are working, but I cannot get pdfbeads to add the hOCR layer to the final PDF (ie. to make it searchable).
I have also installed hpricot ruby gem correctly.
Did anybody have any success in creating searchable PDF via cuneiform + pdfbeads on Windows?
I can only create the hOCR files with Cuneiform on the one hand, and the PDF file (encoded in jbig2) with pdfbeads, but I cannot join them in a single searchable PDF...

PS: by the way, I'm using the binary of cuneiform 1.1.0 which can be found inside this Windows program: CuneiDjvu
http://www.djvu-soft.narod.ru/scan/cuneidjvu.htm (Russian, as usual... use Google Translate)

Thanks for all your help,
Cosimo


EDIT:
Finally I managed to join the hOCR text layer and the tiff images in a PDF using pdfbeads!
I simply had to ensure that both the .html and the .tif files were named exactly the same (eg., image_001.tif --> image_001.html) and put in the same directory.

pdfbeads rocks! :D

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 31 May 2011, 18:40
by daniel_reetz
I'm glad the solution is so uncomplicated. Thanks for sharing it back with us, lupocos.

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 04 Jun 2011, 19:23
by seasalt
has anyone got PDFbeads working in MAC environmennt that can post the steps (for a non technical person)?
thankyou in advance

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 07 Jun 2011, 11:36
by Misty
1. Install Xcode from your OS X install DVD, if you don't already have Xcode.

2. Install Homebrew using the instructions from https://github.com/mxcl/homebrew/wiki/installation
The installer can be run by pasting the following into your terminal and hitting enter:

Code: Select all

/usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)"
3. Use Homebrew to install the extra tools you need:

Code: Select all

brew install imagemagick tesseract jbig2enc
This will install PDFBeads' dependencies, as well as the Tesseract OCR engine that can produce hOCR that is compatible with PDFBeads.

4. Run the following commands to install PDFBeads and the required tools:

Code: Select all

gem install pdfbeads hpricot rmagick
If you're using the version of Ruby that comes with the OS, you will need to use 'sudo' to install. This will require a password. The commands in that case should be

Code: Select all

sudo gem install pdfbeads hpricot rmagick

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 08 Jun 2011, 02:28
by seasalt
many thanks Misty - I'm further along but not all the way.

2. Install Homebrew
Per the link (install instructions) I made sure using terminal I was in /usr/local
message received: successfully installed

3. In terminal, typed
gem update

and receceived error message after a bit of processing
Updating installed gems
Updating acts_as_ferret
WARNING: Installing to ~/.gem since /Library/Ruby/Gems/1.8 and
/usr/bin aren't both writable.
WARNING: ..../.gem/ruby/1.8/bin in your PATH, gem executables will not run.
ERROR: Error installing acts_as_ferret: bundler requires RubyGems version >= 1.3.6
Updating arel
Successfully installed arel-2.1.1
Updating builder
Successfully installed builder-3.0.0
Updating erubis
Successfully installed erubis-2.7.0
Updating mail
Successfully installed mail-2.3.0
Updating rack
Successfully installed rack-1.3.0
Updating rack-mount
Successfully installed rack-mount-0.8.1
Updating rack-test
Successfully installed rack-test-0.6.0
Updating rails
ERROR: Error installing rails:
bundler requires RubyGems version >= 1.3.6
Gems updated: arel, builder, erubis, mail, rack, rack-mount, rack-test
Installing ri documentation for arel-2.1.1...
Installing ri documentation for builder-3.0.0...
ERROR: While generating documentation for builder-3.0.0
... MESSAGE: Unhandled special: Special: type=17, text="<!-- HI -->"
... RDOC args: --ri --op /Users/..../.gem/ruby/1.8/doc/builder-3.0.0/ri --title Builder -- Easy XML Building --main README.rdoc --line-numbers --quiet lib CHANGES Rakefile README README.rdoc TAGS doc/releases/builder-1.2.4.rdoc doc/releases/builder-2.0.0.rdoc doc/releases/builder-2.1.1.rdoc --title builder-3.0.0 Documentation
(continuing with the rest of the installation)
Installing ri documentation for erubis-2.7.0...

then continues til end....

So I am not sure what is the impact...
I have not tried next step of "gem install pdfbeads hpricot rmagick" in terminal window

any ideas?
on 10.6x

(plus bandwidth usuage very high -- 1gb used up -- is this the HomeBrew install?)

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 08 Jun 2011, 10:17
by Misty
Sorry about that. I'd forgotten that the default gem folder in Mac OS X requires superuser (sudo) access to write to. The error messages you showed are actually not very serious, even though they might sound bad.

You can go ahead with the rest of step 3, but use the command

Code: Select all

sudo gem install pdfbeads hpricot rmagick
You will have to give it your password. This will install the required gems in a location where you can run them. Follow the rest of the steps as written.

Edit: Just found a trick I was not familiar with. To install jbig2enc without waiting for Homebrew to officially add it, use the following command:

Code: Select all

brew install https://raw.github.com/mistydemeo/homebrew/0c3427ee1e9be6aaaed5a15f8d0d6e63d610d2f1/Library/Formula/jbig2enc.rb

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 08 Jun 2011, 21:51
by seasalt
thanks Misty

Ientered:
sudo gem install pdfbeads hpricot rmagick

and then password
and ruby looks to be sorted now

then continued step 4:
entered:
brew update

result:
Please `brew install git' first.

... yesterday I got confirmation message installed successfully for brew.

ideas?

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 09 Jun 2011, 12:09
by Misty
Homebrew was installed, but it looks like it doesn't install git by default - which it needs to do "brew update". Just do a

Code: Select all

brew install git
And wait for that to finish. You'll then be able to brew update without a problem, and finish the steps.

Re: PDFBeads — Convert Scanned Images to a Single PDF File

Posted: 09 Jun 2011, 22:51
by seasalt
okee... we are a little further ... and 2 error messages

I typed in:
brew install git

got several "what looked to be successful install"
e.g.
==> Downloading http://kernel.org/pub/software/scm/git/ ... .7.5.4.tar.
######################################################################## 100.0%
==> Caveats

then....
Bash completion has been installed to:
/usr/local/etc/bash_completion.d

Emacs support has been installed to:
/usr/local/Cellar/git/1.7.5.4/share/doc/git-core/contrib/emacs

The rest of the "contrib" has been installed to:
/usr/local/Cellar/git/1.7.5.4/share/contrib
Error: The linking step did not complete successfully
The formula built, but is not symlinked into /usr/local
You can try again using `brew link git'
Error: Permission denied - /usr/local/etc/bash_completion.d
==> Summary
/usr/local/Cellar/git/1.7.5.4: 1062 files, 19M, built in 80 seconds

ideas?
(thankyou for all your help Misty)