This guide assumes you are running a bash shell on a UNIX-ish OS. So, without further ado, here are the steps:
- OCR Scan Tailor output TIFFs using Cuneiform 0.8*. You can use Tesseract for this, but I find Cuneiform to be slightly more accurate with 'bad' input.
*It has to be version 0.8 as this is the latest version sans new HOCR formatting which breaks hocr2pdf compatibility - https://bugs.launchpad.net/cuneiform-linux/+bug/623438Code: Select all
for img in *.tif; do cuneiform -f hocr -o $img.hocr $img; done
- Create hidden-text-layer PDFs for each page using hocr2pdf
Code: Select all
for img in *.tif; do hocr2pdf -i $img -o $img.pdf < $img.hocr; done
- Convert our cover page JPG into a PDF using ImageMagick. Replace <>s with your desired DPI values.
Code: Select all
convert cover_front.jpg cover_front.pdf -density <input jpeg DPI> -filter mitchell -resample <output pdf DPI>
- Merge together our resulting PDFs
Code: Select all
pdfconcat -o merged.pdf ./*.pdf
- Re-compress our PDF's images using pdfsizeopt (found at http://code.google.com/p/pdfsizeopt/)
Code: Select all
python pdfsizeopt.py --use-pngout=false --use-jbig2=true --use-multivalent=false merged.pdf
- Extract metadata using pdftk
Code: Select all
pdftk merged.pso.pdf dump_data output pdf_meta.txt
- Edit pdf_meta.txt in your editor of choice; guide to allowed parameters here (under 'A couple of infrequently used options'): http://www.linux.com/archive/feed/53701
- Merge-in edited metadata
Code: Select all
pdftk merged.pso.pdf update_info pdf_meta.txt output book.pdf
It took me a long time to get to a method which worked satisfactarily. I was trying to bypass the individual pdf-creation, merging and compression using pdfbeads but I couldn't get rmagick to compile on either of my systems. Out of interest, I did manage to install it on a Windows box; in terms of file-size, this method seems just as good
And, in my mind, my method wins on philosophical grounds - it's much more UNIX-like