DIY Book Scanner

Posted: **25 Apr 2012, 11:17**

Hi all, I'm contemplating scanning all my physical books to djvu, and am practicing by converting some PDFs I have around (some which were good scans, some not so good). I figure that would give me practice cleaning up images and creating the smallest high-quality representation possible. I'm only a few days in, but I've already made quite a few observations that I haven't seen spelled out anywhere.

I thought maybe other beginners might benefit, or maybe the advanced people here can tell me how wrong I am

So, first topic.... I really like minidjvu over cjb2. I think the lossy output looks a little nicer, and I assume the shared dictionary across multiple pages is helping me size-wise. BUT, I had a hard time figuring out how to incorporate the mixed-mode (text-and-graphics) files into that mix. In other words, if one of the pages has a picture, I want to:

send the bitonal part through minidjvu along with all the other pages at 600dpi
send the colorful part through c44 at 120 or 300 dpi
splice in the colorful part into the correct page as the background

Well, the shared shape dictionary in the minidjvu output made this tricky (for me, anyway). I couldn't figure out how to make djvumake happy. It kept complaining that it was expecting the dictionary. So, I extracted it, too. No help. I extracted the INCL chunk from the text page and passed it in. It still complained. After a bit more trial and error, I found out it wanted the dictionary (iff) file as the INCL agument, instead of the INCL chunk from the page. Tricky...

I also later realized that just having minidjvu create an indirect djvu file saves me the step of extracting the pages and dictionaries. That was a nice discovery!

So... here's basically what I do:

Code: Select all

minidjvu -d 600 -i -r -l page*.tif indirect.djvu

# let's say page 3 has a diagram, which I've resampled to 120dpi and run through c44
# we need to get the BG44 chunk out of it
djvuextract page-003image.djvu BG44=page-003image.bg44

# now move the minidjvu output out of the way...
mv page-003.djvu page-003text.djvu

# ... and replace it with the combined text and graphics page:
djvumake page-003.djvu INFO=,,600 Sjbz=page-003text.djvu INCL=pages-001.iff BG44=pages-003.bg44

... and obviously when I'm done I convert it all to a bundled file with djvmcvt.

Does that seem like a sane way to proceed? (Don't worry, I have scripts that help out... I don't type all that stuff every time)

Some other things that confuse(d) me: For the Sjbz argument to djvumake, you can just give it a djvu file and it pulls out the Sjbz chunk automatically. (at least it seems to)! BUT you can't do the same for the BG44 argument. That's too bad. It seems logical that it should accept djvu files for any arguments like that, and just extract the relevant chunk on the fly.

Posted: **25 Apr 2012, 11:42**

A couple related observations regarding mixing text and images.

ONE: I don't see a use case for cpaldjvu. When I was first looking over the toolset I assumed I'd use it all the time. But... its output spans the background, foreground, and mask layers of the djvu output. Since I want to work with images at a lower DPI than text, I can't find a way to split-and-recombine the image and text layers if I use cpaldjvu on the images. If I just use cpaldjvu at 600 DPI, the output is huge compared to a high-quality c44/minidjvu run on a 120/600DPI. So... is cpaldjvu only useful if you do images and text at the same dpi?

TWO: When separating the mixed-mode TIFs, I'd cut down the dpi on the image portion from 600 to 120 like this:

Code: Select all

convert -opaque black input.tif -resize 20% output.jpg

... and the first few times this was fine. But eventually I got errors about the images not being an integer subsample of the text. I "fixed" this by making sure the file I started has a width and height that's a multiple of 5 (using convert -extent). Today I noticed that convert has a -resample option (yeah, I'm a total beginner!). If I used that instead of -resize, would the resolutions always match up, or would I still need to make sure I have integer-divisible input resolutions?

THREE: It seems that csepdjvu has two deficiencies over what I do. First, it lets me adjust the c44 slices, but not the -decibel (at least, I assume it uses c44 since it has the slices parameter). I like to play with both to get the quality I want. Secondly, I didn't see a way to use the shared minidjvu dictionary with it. Is that right?

Posted: **26 Apr 2012, 15:00**

Today's topic is: callout boxes and shaded tables. I have plenty of books that have occasional call-out boxes where the text is still black, but the background is grey or some other light color. Similarly, many tables of information have blocks of rows shaded to aid the reader. How to deal with these?

The easiest thing to do is to treat those pages like a picture. I've seen plenty of PDFs where the text is nice and black, until you get to a page with a graphical element and now everything is grayscale. Ugly!

The next-easiest thing to do is to use ScanTailor's mixed-mode to block off the call-out boxes and tables as imagery. That's better, but you have to make sure the image portions are not very compressed to keep the text readable, and you have to use a high dpi on that portion for the same reason.

What I do, then, is add a post-processing step after outputting a ScanTailor mixed-mode image of the page. For a simple black-on-grey call-out box, a variant of this command does the job:

Code: Select all

convert page-007.tif -black-threshold 70% page-007x.tif

What this does is turn the text in the callout box back to pure black. This way when we separate the image into text and graphics, the text in the callout box goes with the text layer, and the image layer is just the grey box underneath. This way, you can compress the heck out of the image layer (and cut down the DPI as well), and still have beautiful 600-dpi text on top of it.

Here's an example with the "blackened" mixed-mode page on the left, and how it splits out. If I hadn't done the blackening step, the whole table would have been on the image side of the house. This is much better. You can see some artifacts left over in the image, but if you use the text part as a the -mask argument to c44, it does a good job of ignoring those parts. Plus, since we're compressing the image so agressively, the little text artifacts tend to get smudged over all by themselves.

: ex_txt.png (154.9 KiB) Viewed 27939 times

Here's a shot of the dvju result once the two components are encoded and recombined:

: ex_out.PNG (74.48 KiB) Viewed 27939 times

... the BG44 section of which was about 1k. I could never reproduce the table as an image that cheaply, but the gray boxes are no problem whatseover!

Posted: **27 Apr 2012, 00:42**

Ok, so today I tried some pages with both pictures AND colored text. And, as usual, I want all the text to run through minidjvu for best quality and size. For the colored text, I tried the code given in this post, but it didn't quite work for me. I did find a similar method that seems to be working on my files, so I thought I'd share it here. It really is the same method as in the link above, and I'm grateful to have found it or this would have taken me much longer to figure out... I just needed slightly altered processing to make it work on my files.

For simplicity, let's just focus on colored text and not the image, since it's really two issues (the image goes in the BG44 chunk of the final file just like the case when there's no colored text anyway).

So, let's look at this page. Not the greatest-looking scan, to be sure. Hopefully we'll clean it up a bit in the process. All these pictures are of course 10%-size versions so I'm not putting huge files on the forum. You should still be able to get the idea:

: expage.png (357.31 KiB) Viewed 27928 times

... First, I use scantailor to produce a nice bitonal version, which I run through minidjvu...

: expagebw.png (149.18 KiB) Viewed 27928 times

... So that's the mask layer (Sjbz) of the final product. Now we need the FG44 chunk to tell us which portion of the text should be red. I use this command:

Code: Select all

convert ${1}.tif -black-threshold 70% -white-threshold 80% -fuzz 20% -fill crimson -opaque "#F2628E" -fill black +opaque crimson -morphology DilateI:3 Octagon:3 -morphology DilateI:3 Rectangle:3x1+1+0 -morphology CloseI:3 Disk -resize 20% pt1.ppm

Let's break this command down:

-black-threshold 70% -white-threshold 80% This just makes nearly-black pixels black and nearly-white pixels white. It helps ensure we don't accidentally think the normal text is colored text in the next step.
-fuzz 20% -fill crimson -opaque "#F2628E" This grabs pixels that look like the input red pixels, and turns them crimson. I used a paint program to identify the input color as near #F2628E
-fill black +opaque crimson This isolates the crimson pixels I just created and makes everything else black.
-morphology DilateI:3 Octagon:3 -morphology DilateI:3 Rectangle:3x1+1+0 -morphology CloseI:3 Disk This fattens up the crimson areas via dilation and "closing". I read about it here: http://www.imagemagick.org/Usage/morphology/ In a perfect world this shouldn't be necessary, but if the first step didn't quite grab all the right pixels, then this acts as insurance. We'll likely have covered all the text area in crimson after this step.
-resize 20% We don't need to do this color masking at the full DPI of the image. This saves on file size.

... maybe a '-despeckle' should be in there prior to the morphology stuff in case I still manage to grab a stray pixel or two... hmm... anyway...

The output of the above command looks like this:

: pt1.png (11.76 KiB) Viewed 27928 times

See how the letters are all fattened up? Perfect!

Now, we simply encode that with c44, and extract the BG44 chunk, which will be the FG44 chunk of our final product:

Code: Select all

c44 -dpi 120 -decibel 50 pt1.ppm _foreground.djvu
djvuextract.exe _foreground.djvu BG44=_foreground.iw4

NOTE Common sense says it would be good to use the inverse of the text as a mask for this step, to tell c44 that we don't need much fidelity where there isn't any text to color. However, in my (admittedly limited) tests, the mask cause the output to be all wonky and the crimson color bled into the surrounding text. I'm not sure why this happened, but at least in my case the mask did NOT help me. I may run further tests later to see if I can get that to work.

So, now we have a FG44 chunk, and minidjvu will give us the Sjbz chunk for the text itself. If there were an image on the page we'd use c44 to get a BG44 chunk... and djvumake combines it all for us:

Code: Select all

djvumake out${1}.djvu INFO=,,600 Sjbz=sepvu_txt.djvu FG44=_foreground.iw4

And it looks like this:

: output.png (361.62 KiB) Viewed 27928 times

...which in my opinion is better than the original since the colors are all very clean and deep (unlike the input I was working with).

Posted: **30 Apr 2012, 00:02**

Today's topic: encoding efficiently. I arrange my workflow so that all text goes through minidjvu, and then any pictures are grafted onto the text pages afterwords via djvumake. But, when encoding a 300-page book, you don't want a single minidjvu process... you want at least 1 per core going. So, for my quad-core laptop, I use code like this in the prelude of my scripts to figure out how many files to group together (bash shell):

Code: Select all

# get the total number of files...
file_count=$(ls -1 page-???.tif | wc -l)

echo "There are $file_count files to process."

# figure out how many files per run of minidjvu to use
# in order to run 4 to 5 instances of it concurrently
numpermini=$((file_count / 4))
if [ $numpermini -lt 10 ]
then
  numpermini=10
fi

while [ $((numpermini % 10)) -ne 0 ]
do
  ((numpermini=numpermini-1))
done
echo " ... so going to run $numpermini files in each instance of minidjvu."

... To maximize efficiency, I make sure I run a multiple of 10 files at a time through minidjvu. Why? Because by default it creates its shared shape tables across 10 files at a time, and I run at the default setting. This way, I always get the most use of the shared tables, and guarantee that I'm only going to run 4 or 5 concurrent processes. That's pretty much ideal.

Then as the files are pre-processed, they are collected into an array in groups of the right length. Once everything is collected, I just run them like this:

Code: Select all

  # get the last index...
  ind=${#process_list[@]}
  ind=$((ind-1))

  # loop through the files array, spawning minidjvu if needed.
  for x in $(seq  0 $ind)
  do
    if [ ! -e xindex_${x}.djvu ]
    then
      echo "EXECUTING: minidjvu -dpi 600 -r -l -i ${process_list[x]} xindex_${x}.djvu &"
      minidjvu -dpi 600 -r -l -i ${process_list[x]} xindex_${x}.djvu &
    else
      echo "xindex_${x}.djvu already exists, so skipping."
    fi

  done

  wait

The "wait" at the end is very important, since we need to pause at this point in the script until all of the minidjvu's finish running. Technically I should be collecting the error codes and making sure all of them were successful, but since it's just a personal script and nothing too terrible happens if one of them fails, I took the lazy way out

As a side note, you can see above that I skip the run if the djvu file it would create is already there. I do something like this at every stage of processing, so that I can correct things piecemeal when I'm proofreading the output. Say one of the pictures comes out fuzzy. I can delete the .iw4 file and re-run with different parameters. The script will skip every step except to make the missing iw4 file and then add it to the correct .djvu page. Knowing how useful that has turned out to be, I almost wish my script generated an appropriate makefile, since all that tracking would be done for me instead of me having to code it explicitly. Oh well... maybe if I do a re-write.

Posted: **30 Apr 2012, 10:54**

Today's topic: working with PDF files as input. As I mentioned in my first post, I am practicing on PDF files to make sure I am competent before I (destructively) scan all my books. In the process, I have learned a few things.

Tip #1 Don't use Imagemagick (convert) to split a PDF file into individual images. It is terribly slow for some reason. Even though it uses ghostscript under the covers, you will find using ghostscript directly is orders of magnitude faster.

Tip #2 Split the PDF into PNG files, rather than TIF. Scantailor can read the PNGs and output TIFs later, so the end result is the same... so why do I say this? Well, on one PDF file I found the TIF output was fine-looking, but Scantailor's output was 80% black on some pages for reasons I'll never understand. However, splitting into PNG and re-doing Scantailor avoided that issue, so I've done it ever since, just in case

Tip #3 I see examples on the net using gs to split a PDF with the pngalpha driver. This is wasteful for our purposes. If the pages are black-and-white, use the pngmono driveer. If there are grayscale diagrams use the pnggray driver. Any color pages should use the png16m driver. Don't be tempted by the png256 driver if you know that they didn't use many colors in the original printing... you'll get a messy dithered result.

So for example, this is typically how I run a book with diagrams and grey pictures:

Code: Select all

 gswin64c -sDEVICE=pnggray -dFirstPage=2 -dDOINTERPOLATE -sOutputFile=page-%03d.png -dSAFER -dBATCH -dNOPAUSE -r600x600 Myfile.pdf

FirstPage=2 is there to skip the cover, which I'll extract later with the png16m driver to get the color. (Note since I'm on a windows box I use gswin64c, while I believe on linux it's just 'gs')

Tip #4 Don't trust a PDF's reported DPI. You can see in the command line above I've requested the pdf be rasterized at 600x600DPI. I'm not sure exactly how gs and the pdf file interact to work that out, but clearly sometimes it's off, and you don't want to propagate that error. The rule of thumb mentioned on a Scantailor tutorial I saw was to measure how many pixels high 6 lines of text are. That's approximately the DPI in many cases.

So, after extracting at a supposed 600DPI, I check. If it's off and the files are too big, I resize them to 600DPI-size with 'convert' making sure to set the -density to 600 since I know that's what it really is. If it's off and the files are too small, I delete everything and re-extract asking for an appropriately higher DPI. So, let's say I ask for 600 and the output is more like 300... I extract again at 1200DPI, and make sure I tell ScanTailor that it's really 600 when I import them.

Some really bad PDFs will have a mix of DPIs in them, and it takes more work but you can extract and re-size until they are all the same. It's not strictly necessary since you can tell scantailor about the mix that you have, but something about my personality makes me want them to be the same in the first place--like they would be on a fresh new scan.

Posted: **03 May 2012, 14:34**

hmmm.... when I selected DJVU as my format of choice, it was based on the appealing technical merits of the format. However, I've been looking around and DJVU is definitely giving me the impression that it's dying out. That could be incorrect, but that's the impression it's giving me.

Witness:

DjvuLibre's front page http://djvu.sourceforge.net/ points me to LizardTech for commercial offerings, when the company has had nothing to do with DJVU for years now. That was surprising, and a big red flag to me. When the community can't keep their front page up to date, it feels like a ghost town.
I eventually found caminova, but their software was last updated in January of 2010, and when I search for reviews I mainly find threads about how the quality has gone downhill. Really not a good sign.
When I go to the Safer-Create website, I don't even see the word DJVU on the page, though djvu sites say they make djvu files.
On the djvu.org forums, many questions have zero responses.
The number and quality of DJVU readers for iphone/ipad is pretty poor compared to PDF readers. The main one I had heard about, Stanza, is not only a defunct product but also internally converted djvu to huge pdfs. The app I'm using now is OK, but very slow with occasional crashes. That's not inspiring.
I see people on this board saying the compression gap between PDF and DJVU has largely closed for text due to PDF's support for jbig2, and a lot of the DJVU articles I had been so impressed with predated this development. So maybe instead of converting my old PDF scans, I should be optimizing them.

... and that last point means I really haven't wasted much time, since separating out text and images for later recombination is also what I need to do to get tiny PDF files (as far as I can tell anyway, barring Adobe's ClearScan which is expensive).

So, to that end, I have downloaded and compiled leptonica and jbig2enc on cygwin. That went flawlessly. So later I will try pulling out the Sjbz layer of a djvu file and re-encoding it as jbig2. I'll pull out the BG44 layer and encode it as jpeg2K. Then my question is how to combine these on a PDF page. Any clues there?

Posted: **03 May 2012, 17:25**

Follow-up. Oddly it wasn't immediately obvious which command I should use to split my djvu files into TIFs. So, I wrote a small script to do the work:

Code: Select all

#!/bin/bash

numpages=$(djvm -l $1 | grep 'PAGE #' | wc -l)
echo "$numpages pages in the DJVU..."

for x in $(seq 1 $numpages)
do
  pagename=$(printf "%03d" $x)
  echo "On page-${pagename}.tif"
  ddjvu -format=tiff -page=${x} $1 page-${pagename}.tif
done

Now, I run jbig2 on them:

Code: Select all

jbig2 -v -s -p page-*.tif

Then, to my dissapointment, the 'pdf.py' that comes with jbig2 is a python2 file. I don't really want python2 on my machine. I wish the world would just move to python3. So, I updated the script to python3, mostly just needing to encode() the binary strings. Then I ran it:

Code: Select all

python pdf.py output > out.pdf

... and immediately I have two observations:

jbig2enc runs WAY FASTER than minidjvu
the resulting PDF file was smaller than the DJVU file.

This was on a file that was entirely black-and-white text. 404 pages of it to be exact. So color me impressed with jbig2enc and bullish on PDF going forward. I still need to track down tools to:

shove a jpeg2000 file into a pdf page, like djvumake will do with BG44 on djvu pages.
do the OCR... I found hocr2pdf... maybe if I use it with -no-image, I can then shove the jbig2/jpeg2k stuff on top of it later?
do the outline/metadata. I found pdftk, and I think it will work for that

Any ideas on that stuff? Is anyone even reading any of this?

(by the way, I tried to attach the updated python3 pdf.py file I made, but the board doesn't allow that extension. Oh well...)

Posted: **05 May 2012, 23:17**

Update: after a couple days, I now have a python3 script that takes jbig2 output as a mask, and can combine that with background and foreground files of a much lower DPI. The result, and the process that leads up to it, is very much like the DJVU workflow I had before.

A couple notes:

imagemagick will sometimes make a PaletteMatte JP2 file, depending on the input. Acrobat reader and sumatraPDF handle this fine, but Foxit and a few others display nothing on the page. I was so confused by this for hours--why were some images fine and others not? Eventually I hit on the PaletteMatte thing, and "mogrify -type TrueColor xx.jp2" saved the day.
When the foreground layer has less DPI than the mask (which is kinda the point!), every PDF reader I've tried handles it fine except ones that use the iOS PDF engine. On those, you get the lower resolution for everything, so colored text comes out blocky. So that means iBooks and Kindle and iOS Safari, and many free apps are out. Adobe's iOS app handles it fine, and I may see if GoodReader can handle it soon, since adobe's app doesn't have document organization tools yet.

So, my next steps are:

Code support for hidden text layer. I'll be looking at the code in hocr2pdf and pdfbeads for a model of how to do that.
Code support for outlines/bookmarks, so I can have a clickable table of contents.

With those two things, I will be able to easily convert the DJVU files I've made (only about 25 books, thank goodness), and continue forward in PDF-land.

Next longer term goal: write a program to split out text/background/text-color layers automatically. When there's a lot of colored text on the page, scantailor just doesn't quite cut it. I think with heuristics I can at least get something that's quick and easy to fix up by hand.

Posted: **05 May 2012, 23:29**

I just wanted to say: Keep up the good work!

I'm pretty clueless when it comes to these things and DJVU is not something I have an interest in but PDF I do. And I can see this being very usefull down the road for a lot of people. So the least I can do is encourage you to go on

DIY Book Scanner

Learning to Create Tiny DJVU files

Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny PDF files

Re: Learning to Create Tiny DJVU files