Library Scanning - Start of new project

fishent · Post by **fishent** » 26 Jan 2015, 17:05

I am presently starting a new project scanning 15,000 older books. I wish I would have found this forum sooner. For better or for worse I presently own the Fujitsu SV600 scanner. I purchased it because we cannot destroy the books we are scanning. Now comes the fun part.

The software is lacking for our purposes. Our intention is to scan, save as searchable pdf (possibly in the future do OCR) and make available for free public access. The project is huge and my resources are limited. I want to keep it simple. It's just like life, if you don't keep your eye on the ball things can get complicated. For starters, here are some basic questions I have?
1.) Pages need to be flat and straight. If finger detection cannot be removed by global means, then I would simply come up with a way to flatten the page before scanning. Is there such a thing? Non reflective glass?
2.) I would also need to globally crop the pages as the scans show the darker pages on the right and left border.

Any other advice would be helpful, now that I am at the beginning stages of this project. I can see this forum has a lot of very good information, so I apologize in advance but would appreciate some tips from the experts pointing me in the most simple direction ...... I feel flogged with all that is available!

BruceG · Post by **BruceG** » 28 Jan 2015, 18:29

I have used scanner for most of my work, but recently I photographed a number of books on a interstate trip. I did this because of time. Camera on tripod pointing down to book on table, turn page, click etc. A poor mans Fujitsu SV600. I now have similar problems to you. My end point OCR searchable pdf. Searchable in that I can search all my material at once. Acrobat can make a index of all the material. Abobe reader can then be used for searching. Great for researching.

I understand the the software with Fujitsu SV600 can remove fingers. Other ways is during OCR or by cropping with Acrobat. Both methods will remove the 'darker pages on the right and left border'.
As for keeping pages flat, I failed here, except for using one finger, the other finger was used to take the picture. There is software (see articles by David Landin, he has also made some videos) that help doing this. But taking two pages at a time as the Fujitsu SV600 does and I did does not help. I would not do that again. Yet I would not have photographed as much as I did.

There is Non reflective material, lighter and less prone to breakage than glass. Again I would refer you to David Landin work.

15,000 is a lot of books. You may find that some of the non copyright books have already been done. Books are easier than magazines though.

trust this helps a little

duerig · Post by **duerig** » 29 Jan 2015, 11:08

For the laser scanning work, I've developed techniques for background removal, crop to page, and finger removal. They are pretty robust if you have the right callibration data (suitable picture of your hands and the background in the same lighting conditions as the book). You could use those parts of the program without using the laser dewarping part (since you lack laser lines).

Physically holding books flat doesn't sound like an option for you. The scanner you are using would not be able to deal with an improvised v-shaped cradle. And a piece of glass or plastic on a flat book would likely damage the book. Holding it flat with your fingers at the edges will provide only limited help.

Dewarping pages in software is much trickier. You might have some luck passing the cropped version of the pages through Scantailor in command line mode. It does a reasonable job on many pages, especially ones with a lot of text. But it has troubles frequently enough that you will have to review each page afterwards and fix some small percentage of them.

Once you have pages that are flat and cropped, there are both open source and commercial packages that do OCR. The OCR will be necessary to make the book searchable, even if you are displaying photos of the pages for people to read. The open source program to do this is called PDFBeads, though I haven't had a lot of success with it myself.

fishent · Post by **fishent** » 22 Feb 2015, 06:37

Dear Everyone,

Thank you for your advice and suggestions. Since I am new to this forum I do not visit it that often. Please do not consider this a lack of interest, as my project scans a distance of 15,000 books. So I have been off doing some practical testing so that I can speak with a bit of intelligence. You guys are great, and professional, and I respect your hard work, believe me, I know it is hard work. I wish I could have met you folks before I made my investment, sorry

Here is what I have found so far. I am stuck with two Fujitsu Scansnap SV600. It has it’s own driver, so not compatible with any other scanners on the market, and it only scans JPG and PDF format. Not sure what JPG format, as in menu it says JPEG (jpg) and it saves as JPG. So I’m not sure if it is JPG or JPEG.

Here is the good side. I can set the scanner to scan every three seconds. I have also found through hours of testing that I can use a piece of glass to overlay the book. And yes, I can make a balancing device to keep the book flat. Here is the trick with the glass. You need to spray it with a special matt spray to avoid reflection, and you need to put black matt tape around the borders of the glass. This means that each scan is automatically cropped, and without fingerprints anywhere on the scan. It can also correct sloping text, but only up to 5%.

I have chosen to save the scans as 600 dpi, jpg format. At least this is the extension the scanner software gives the files. The files are automatically cropped during this process and saved. One jpg file per single page.

So, I end up with a jpg file, the text has good borders on all sides, but not even. I thought I would try to use the good ScanTailor software to do some final touchup, but when I go and try to load the project, I find that ScanTailor does not recognize the jpg files.

Should anyone be kind enough to give me some advice I have uploaded a few JPG files (3MB) to mailbigfile, they can be downloaded by clicking the following zip file link. The file name is Scanning Forum Zip and the mailbigfile download link http://mbf.cc/mo0ier

Can anyone give this amateur some help please. I understand how difficult it can be discussing these things on forum. I am glad to discuss on telephone or Skype.

Kind Regards - Myron

cday · Post by **cday** » 22 Feb 2015, 08:03

fishent wrote:I am stuck with two Fujitsu Scansnap SV600. ... it only scans JPG and PDF format. Not sure what JPG format, as in menu it says JPEG (jpg) and it saves as JPG. So I’m not sure if it is JPG or JPEG.

JPEG is the 'official' name of the format, but .jpg is the usual file extension, although .jpeg is also sometimes used:

http://en.wikipedia.org/wiki/Jpeg

I can set the scanner to scan every three seconds. I have also found through hours of testing that I can use a piece of glass to overlay the book. And yes, I can make a balancing device to keep the book flat. Here is the trick with the glass. You need to spray it with a special matt spray to avoid reflection

Non-reflective glass (and acrylic) is available if you are able to access it and it is worth the effort...

... you need to put black matt tape around the borders of the glass. This means that each scan is automatically cropped, and without fingerprints anywhere on the scan.

That seems a novel idea...

So, I end up with a jpg file, the text has good borders on all sides, but not even. I thought I would try to use the good ScanTailor software to do some final touchup, but when I go and try to load the project, I find that ScanTailor does not recognize the jpg files.

Could that problem be caused by the issue described in this post?

http://www.diybookscanner.org/forum/vie ... lor#p17281

Can anyone give this amateur some help please. I understand how difficult it can be discussing these things on forum. I am glad to discuss on telephone or Skype.

Anyone else...

fishent · Post by **fishent** » 22 Feb 2015, 11:23

Thanks for the tip, I'll look into the anti virus software issue, I'm using Avast. Maybe I should uninstallit and see if this makes a difference? I'm new to scantailor but the demo I saw was impressive.

Good to know that jpg & jpeg are same foremat, do I understand right? Seems like I read somewhere that jpg had better results, i.e. Higher resolution then jpeg.

cday · Post by **cday** » 22 Feb 2015, 11:27

fishent wrote:Thanks for the tip, I'll look into the anti virus software issue, I'm using Avast.

The tip was really more about the need to click on 'Select Folder' in order to see the images in the selected folder, which isn't entirely obvious...

I would try that first, as I think it is the likely explanation for the problem you described which also confused me.

Edit:

This post from the thread linked to in my first post, and the earlier posts in that thread, explain the issue in more detail:

http://www.diybookscanner.org/forum/vie ... 989#p17286

BruceG · Post by **BruceG** » 23 Feb 2015, 01:18

fishent
I had a look at your scans, are these originals or did you reduce the size. I would have expected 600dpi would have made larger files. Perhaps the Fujitsu did some processing.
I used OmniPage to OCR them as is and output pdf files. The first without editing and the second I used InFix to do some editing.

Scansnap scans no editing OmniPage.pdf: (99.45 KiB) Downloaded 516 times

Scansnap scans OmniPage InFix.pdf: (640.69 KiB) Downloaded 574 times

The biggest problem with the Fujitsu Scansnap from the file that was looked at is the wavy lines and italic text.
I had a look at Internet Archive for China Inland Mission material and came across a book 'The story of the China Inland Mission' published 1894

This screenshot shows how or what format it is available in. I downloaded the pdf file and extracted page 6 to OCR.

: screenshot-archive org.png (18.27 KiB) Viewed 12935 times

China Inland Mission p6.pdf: (58.77 KiB) Downloaded 452 times

China Inland Mission OmniPage p6.pdf: (32.41 KiB) Downloaded 502 times

The pdf from Internet Archive had already had OCR, perhaps with Acrobat as there is a text layer on top.
The scanned book is a very good scan. Something to aim for with our own scans, mine fall far short of this quality.

So 15000 books 300 pages a book is a lot of work. If others have already done the same books don't waste your time. Internet Archive is a great source of out of Copy right books. Also encourage others to take up the challenge to assist.

I would be interested to know how the material is to be used. If it is to be just read, pdf is smaller than jpg. If text is to be extracted (copy & paste) use pdf. If the 15000 books are to be searched at once for person/location etc pdf is the way to go. If they are only to be read, the quality of scans needs to be good enough to enable that on the device being used.

fishent · Post by **fishent** » 24 Feb 2015, 11:16

Thank you for your patience, "Click on Folder" for JPG to showGood idea, I always was a slow learner ..... but determined. I'll try this. To bad I have a full time job to contend with, full days and nights, so be patient. Appreciate all your help. Thank you

murgen · Post by **murgen** » 25 Feb 2015, 08:50

I understand you are looking for a efficient workflow that will automated as much as possible the process between the scan and the creation of the PDF/A.

I used Scantailor but now I am looking away from it. There are some backdraws you cannot get around. So I started extensive alternatives research. My final goal is producing PDF of good quality, pages and margin must be same size and file size reduced as much as possible.

Let me expose, in respect of PDF/A production the shortcoming of ST and the alternate solutions. At this stage you are supposed to have a directory of all the scanned images sorted but not yet cropped (I use lupparename to merge the 2 streams of photo (http://rename.lupasfreeware.org).

My first batch use ScanTailor client in batch processing and Tesseract 3.03 for the OCR (compiled under Cygwin) whose OCR is very good, nearly on par ABBY12 with standard font, less good on Italic and irregular fonts.

But ST crop function is not always accurate and it may crop text and page numbers. ST dewarping is not bad but imagemagic 'convert -deskew 50%' is simply better.
The intermediate solution was running ST step 1-5, have ST generate a project file, open ST manually, review the document contents box and process output. But for 500 pages it takes 1 hours.

But the real bad thing in ST is about margins. If you don't set 'adapt the margins to all pages' you end with pages of different size. If you set it, first you need to track which pages gives you horrible large margin then you realize in the PDF that the font size have been enlarged for the smallest pages.

I found JPEGCrops (http://ekot.dk/programmer/JPEGCrops/) which allow me 2 things: setup different left and right crop pages while set target output pixel pages size. This solve the pages size difference wihtout resorting to 'convert -resize' whose side effect is to alter the font size. In the PDF the effect is that you see the font size change every page.
It takes less time to crop and fix 500 scan than to scroll them in ST. Set up crop on odd page, assign a keyboard touch, same for even page that's it.The trick is to manage to set the crop box size for odd and even scans to same size. Stop sometime to adapt if needed but the flow is still very fast.

Next step is dewarping, bitonal and size file reduction. ST bitonal is quite good but not godly it may even even be very bad in black_and_white mode on low level quality scans or overexposed images.
On the site I am exploring a gimp script which allow batch processing in gimp and is best so far : bitonal-converter-v.0.4.zip : http://www.diybookscanner.org/forum/vie ... p&start=20
But it produces tif file 15% bigger than the best Imagemagick output whose quality is not too far from gimp on standard quality scan thought I never could reproduce the quality of Gimp on low quality and over-exposition.

Having said that, if your image quality is stable of reasonable sharp and light correct the best combo I never could beat in term of quality/PDF size is a combo of ST/convert. On bad quality the combo collapse for ST skrew everything that why I still search. I tried ST Mixed colors but out size after bitonal process is too big.
I give here the source of the script and will be happy is somebody may ameliorate it. Note that I used the -deskew of convert for dewarping, it is the best so far.

Code: Select all

#!/usr/bin/sh
# you need to have Scantailor and convert (from Imagemagic) in the path
#set -x

#adapt 
ST=/cygdrive/c/app/ocr/st/scantailor-cli.exe

# --------------------------------------------------- 
function help {
cat <<EOF
  
   do_st_here
          
         -c4      : convert : compress group4
        -dpi      : ST      : set input dpi
        -dcd      : ST      : disable content detection. Default is enable 
         -cm      : ST      : color_mixed             (default graysscale)
         -bw      : ST      : color black_and_white   (default graysscale)
        -zip      : convert :  --compress      zip
          -f      : add and image to process. widcart like '*.jpg', '*.tif' are valids
         -g       : convert : force grayscale and white background
         -ft      : convert : use a cubic filter
        -lzw      : convert --compress      lzw
         -th <n>  : convert --threshold <n>
       -size <n>  : convert         1 :  resize=" -resize 600x800" 
                                    2 :  resize=" -resize 720x1080" 
                                    3 :  resize=" -resize 900x1200" 
                                    4 :  resize=" -resize 1050x1400" 
                                    5 :  resize=" -resize 1200x1600" 
                                    6 :  resize=" -resize 1200x1900" 
       -odpi      : ST      : set output dpi
         -ni      : ST      : set --normalize-illumination to true
       -norm      : convert : normalize in black and white
         -mg <n>  : ST      : Set all margin to <n> millim, disable others margins setting
       -step      : ST      : 1=pg layout,      2=split pg, 3=deskew inclination, 
                              4=content detect, 5 =margin,  6=output
        -th <n>   : ST      : n<0 --> thinner; n>0 --> thickset
       -tfg       : ST      : force grayscale

       ST: -mright <n> -mtop <n> -mbot <n> -mleft <n>  : marging in milimmeter, default is 0

example: 

 do_st_here.sh  -size 3 -step 4 -zip -bw -odpi 600 -dpi 300 -ft  -f 041IMG_7426.JPG
 do_st_here.sh  -size 3 -step 4 -zip -cm -norm -odpi 600 -dpi 300 -ft  -f 041IMG_7426.JPG

usually with -cm you want also -g -norm:

   do_st_here.sh  -size 6 -step 3  -zip  -dk -cm -g -norm -f 134IMG_0759.JPG

for simple usage or to start :
 
   do_st_here.sh -size 6 -ft -g -dcd -f <images>
   
for processing all images in a directory:   

do_st_here.sh -size 6 -ft -g -f '*.jpg'

EOF
exit
}
# --------------------------------------------------- 
despeckle=" --despeckle=off " 
unset threshold
color=black_and_white
start_step=1
odpi=" --output-dpi=600"
dpi=" --dpi=600"
mtop="--margins-top=10"
mbot="--margins-bottom=5"
mleft="--margins-left=5"
mright="--margins-right=5"

while [ -n "$1" ]
  do
    case $1 in
       -b ) black=$2 ; shift ;;
      -b1 ) black=2 ; white=2 ;;
      -bw ) color=black_and_white ;; 	
      -cm ) color=mixed ;; 	
      -c4 ) compression=" -compress group4 " ;;
      -dk ) dk=" -deskew 50%" ;;
     -dcd )  dcd="--disable-content-detection" ;;
     -dka ) despeckle=" --despeckle=aggressive " ;;
     -dpi ) dpi=" --dpi=$2" ; shift  ;;
     -dkc ) despeckle=" --despeckle=cautious " ;;
     -dkn ) despeckle=" --despeckle=normal " ;;
      -dw ) dw="--dewarping=on" ;;
       -f ) 
            for  i in  `ls $2`
                  do
                   LST="$LST $i"
                  done
            shift ;;
      -ft ) filter="-interlace Plane -filter Cubic -define filter:C=0.0" ;;
       -g ) g=" -type GrayScale -colorspace Gray -colors 3" ;;
     -lzw ) compression=" -compress lzw" ;;
      -mg ) mg="--margins=$2" ; shift ;;
    -mtop ) mtop="--margins-top=$2" ; shift ;;
    -mbot ) mbot="--margins-bottom=$2" ; shift ;;
   -mleft ) mleft="--margins-left=$2" ; shift ;;
  -mright ) mright="--margins-right=$2" ; shift ;;
    -norm ) norm="-normalize" ;;
      -ni ) norm_illu=" --normalize-illumination=true" ;;
    -odpi ) odpi=" --output-dpi=$2" ; shift  ;;
      -step ) start_step=$2 ; shift ;;
      -size ) fsize=$2 ; shift 
             case $fsize in
               1 )  resize=" -resize 600x800" ;;
               2 )  resize=" -resize 720x1080" ;;
               3 )  resize=" -resize 900x1200" ;;
               4 )  resize=" -resize 1050x1400" ;;
               5 )  resize=" -resize 1200x1600" ;;
               6 )  resize=" -resize 1200x1900" ;;
            esac
            ;;
      -th ) threshold=" --threshold=$2 " ; shift ;;
     -tfg ) tfg="--tiff-force-grayscale" ;; 
       -w ) white=$2 ; shift ;;
     -zip ) compression=" -compress zip" ;;
      -c  ) contrast="-linear-stretch 10%x10%" ;;
      -v  ) set -xv ;;
       * ) help ;;
    esac
    shift
  done
if [ ! -d out ] ;then
    mkdir out 
fi

black=${black:-0}
white=${white:-0}

# if general  margin is set, disable the other margin
if [ -n "$mg" ];then
    unset mtop mbot mleft mright
fi

# process scantailor part
for  img in "$LST"
do
       if [ ! -f $img ];then
          echo "I do not find the file $img"
          exit
       fi
       $ST  -v      \
           --layout=1.5 --match-layout=false --alignment=center $odpi $dpi  $dcd $dw \
           $threshold --picture-shape=rectangular \
           --enable-fine-tuning  $mg $mtop $mbot $mright $mleft \
           --disable-content-text-mask  --color-mode=$color   $tfg \
           $despeckle  $norm_illu --start-filter=$start_step $img out
done

cd out
for f in $LST
do
  RAD=`echo $f | sed 's/.jpg//'|sed 's/.JPG//'`
  if [ -f  $RAD.tif ];then
           convert $RAD.tif  \( -clone 0 -type Grayscale   -alpha off \) \( -clone 0 -alpha extract -blur 1x65000 -level 50x100% \)  \
                     -delete 0 -compose copy_opacity -composite -quantize sRGB  \
                    $resize $dk $ft $norm $filter $contrast $compression  $g ${RAD}_st.tif
           #rm -f  $RAD.tif
           # if you have tesseract setup:
           #tesseract ${RAD}_st.tif ${RAD} -l fra pdf
  fi
done

DIY Book Scanner

Library Scanning - Start of new project

Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project

Re: Library Scanning - Start of new project