Your Opinions For An Alternative Book Scanning System

cemagan · Post by **cemagan** » 17 Feb 2013, 08:18

dtic wrote: 3. A script loads images to Scan Tailor.
4. manual step: user sets DPI value (Or could this be automated reliably already?) Time: < 1 minute.

As far as I know, DPI or PPI is used to express the amount of dots or pixels obtained digitally for a length of one inch on actual printed material. So I think we need the sizes of a flat page of the book being processing to determine the optimal value. Then we can compute PPI by dividing, for example, number of pixels on the width of a flat page scanned, into the length of the actual flat page in inch. Or for standard sized books, an avarage pre-calculated value may be used...

dtic wrote: 6. manual step: user readjusts selections. Time: this is the most time consuming step! The number of user actions can be decreased a bit using this script. But improving Scan Tailor could save a lot of time here. I think the main problems are when (1) Scan Tailor misses content in the header/footer of pages e.g. page numbers and (2) Scan Tailor positions content incorrectly, e.g. text on a title page starts further down from the top but will be top aligned by Scan Tailor in automatic mode. Scan Tailor enhanced has some tweaks for problem 2 but doesn't fix it completely.

I think the most practical way for both determining page borders and the 3d structure of the pages is to use 4-5 laser line projectors and take additional photos of the pages with superimposed lasers line on them. I think the secret of the automation of the post-processing underlie this application (determining page limits and 3d form properly).

Then flat page images can be achieved with information about page corners and 3d form of the page and a distortion correction algorithm including an interpolation technique.

dtic wrote: As long as step 10 is included in the workflow then, for some actual use cases, 100% accuracy isn't necessary since flaws in the finished document discovered later on can be handled by going back and manually redoing some step.

That is reasonable, you're right, dtic.

dtic · Post by **dtic** » 17 Feb 2013, 11:52

cemagan wrote: As far as I know, DPI or PPI is used to express the amount of dots or pixels obtained digitally for a length of one inch on actual printed material. So I think we need the sizes of a flat page of the book being processing to determine the optimal value.

A good enough DPI approximation is the vertical pixel distance for 6 lines (from top of highest character to bottom of lowest) of text in a captured page image. Tulon, the creator of Scan Tailor, explains here. So the most straightforward solution would be a tool that automatically finds and measures 6 text lines and puts the value in the Scan Tailor DPI dialog window.

cemagan wrote: [Re: selection adjustment] I think the most practical way for both determining page borders and the 3d structure of the pages is to use 4-5 laser line projectors and take additional photos of the pages with superimposed lasers line on them. I think the secret of the automation of the post-processing underlie this application (determining page limits and 3d form properly).

There is discussion on such things in this thread.
But note that if the book scanner has a platen (glass sheet pressing the page while capturing) then there is no page curvature that needs fixing in the first place. The 3D/laser stuff is out of my league so I'm mostly focusing on the workflow using Scan Tailor with "regular" images captured by a system with a platen. But on the hardware side making a prototype for a page flipper of the kind used by the Japanese researchers would be an important first step I think.

dpc · Post by **dpc** » 17 Feb 2013, 13:39

dtic,
That's a good summary of the problems we face today in scanning books comprised of mostly text. Thanks for posting. I agree that Step 6 is the one that takes the most intervention.

I think the goal should be to address all of the steps that require manual intervention first, then begin to look at performance improvements by looking at what work can be distributed across multiple cores/processors/machines.

It might be more useful if the post-processor would just make some assumptions and go through the entire process without any manual intervention and flagged pages that it thinks would require attention later. For example, it would help me a lot to get a list of pages that contained non-textual content so that I could quickly check those after post-processing had completed. If we had a "page validator" that spit out a confidence value per-page to a text file, that would probably help in the development of a completely hands-off post-processor.

A good enough DPI approximation is the vertical pixel distance for 6 lines (from top of highest character to bottom of lowest) of text in a captured page image. Tulon, the creator of Scan Tailor, explains here. So the most straightforward solution would be a tool that automatically finds and measures 6 text lines and puts the value in the Scan Tailor DPI dialog window.

I have a 2"x2" black square on a white calibration page and always shoot these as the first two pages in the collection of images that comprise a scan. I wrote a simple program that scans this image looking for that black square and spits out the horizontal and vertical DPI values and writes that into a dpi.txt file that's stored in a src folder with all the image files that's later archived.

dtic · Post by **dtic** » 17 Feb 2013, 17:32

dpc wrote: It might be more useful if the post-processor would just make some assumptions and go through the entire process without any manual intervention and flagged pages that it thinks would require attention later. For example, it would help me a lot to get a list of pages that contained non-textual content so that I could quickly check those after post-processing had completed. If we had a "page validator" that spit out a confidence value per-page to a text file, that would probably help in the development of a completely hands-off post-processor.

If you run a Scan Tailor step 6 (output) batch job in "mixed" mode it will detect pages with images. If you save the project after the process is done you can parse the xml file for the phrase mixed. But that only gives you a list of page id numbers. They can next be parsed to match filenames. Still, it would be useful if there was a way to after processing in step 6 sort pages with images at the top of the list of thumbnails. I think the first and biggest hurdle here is: finding a programmer who has the skills to add such improvements into Scan Tailor. I don't.

That calibration method is useful. Have you posted the code somewhere?

dpc · Post by **dpc** » 18 Feb 2013, 13:38

dtic wrote: That calibration method is useful. Have you posted the code somewhere?

Sorry, no. My employer doesn't allow me to share code unless I get approval from their lawyers beforehand. They basically own anything I write inside or outside of work unless they review my code and give their prior approval. While I'd be surprised if my bookscanning utilities would cause any problems, I'd rather not get involved in any of that.

dtic · Post by **dtic** » 06 Mar 2013, 08:19

Here is another workflow idea. It bypasses the biggest hassle in ScanTailor - adjustments in the selection step.

hardware steps: the book scanner must have
1. a way that prevents the book from sliding up or down in the tray during capture. A thin piece of wood/cardboard/plastic pushing the book spine from the top and bottom would work.
2. some form of easily repositionable markers on top of the platen glass for the two outer corners of the right and left pages. The markers could perhaps be small squares of brightly color peel off window stickers. Or something else that is small and easy to programmatically detect.

post processing software steps:
1. a script collects images from plugged in camera SD cards, sorts them in L/R folders, rotates and renames them in order 1,2,3,4...
2. a script uses OpenCV to, for each image, detect markers and the platen centre line and crop and deskew the rectangle they together make up.
Code for detecting markers could likely be reused from https://github.com/ytsutano/bookscan . That code in action is showed here http://www.youtube.com/watch?v=rjzxlA9RWio
3. Since we now have images of whole book pages (and nothing outside of those pages) we can bypass selection of subparts of the images. We instruct Scan Tailor to select the whole of all pages, set zero margins and do output (scripted to use multiple CPU cores for speed).
4. A script feeds the finished output to a djvu or pdf tool, including OCR. There are several working solutions here already.
5. manual step: set filename

Shaknum · Post by **Shaknum** » 06 Mar 2013, 13:47

dtic wrote:Here is another workflow idea. It bypasses the biggest hassle in ScanTailor - adjustments in the selection step.

hardware steps: the book scanner must have
1. a way that prevents the book from sliding up or down in the tray during capture. A thin piece of wood/cardboard/plastic pushing the book spine from the top and bottom would work.
2. some form of easily repositionable markers on top of the platen glass for the two outer corners of the right and left pages. The markers could perhaps be small squares of brightly color peel off window stickers. Or something else that is small and easy to programmatically detect.

post processing software steps:
1. a script collects images from plugged in camera SD cards, sorts them in L/R folders, rotates and renames them in order 1,2,3,4...
2. a script uses OpenCV to, for each image, detect markers and the platen centre line and crop and deskew the rectangle they together make up.
Code for detecting markers could likely be reused from https://github.com/ytsutano/bookscan . That code in action is showed here http://www.youtube.com/watch?v=rjzxlA9RWio
3. Since we now have images of whole book pages (and nothing outside of those pages) we can bypass selection of subparts of the images. We instruct Scan Tailor to select the whole of all pages, set zero margins and do output (scripted to use multiple CPU cores for speed).
4. A script feeds the finished output to a djvu or pdf tool, including OCR. There are several working solutions here already.
5. manual step: set filename

Good news! I already wrote the software to do this: http://www.diybookscanner.org/forum/vie ... num#p15272

The problem I ran into was keeping the book in the right place during scanning. I would think you can mount the QR-Codes on thin rulers or something where they can be slid up and down easily and kept more or less in line with each other. The code is freely available for editing and such (but you really must make any modification available to the community), it uses OpenCV and Leptonica, which I was playing around with for cleaning up the page and binarization, but Scan Tailor is really still much better.

dtic · Post by **dtic** » 06 Mar 2013, 14:16

Hi Shaknum. I hadn't seen that, neat!

Though your tool, like ytsutano's, has the drawback of using pretty large figures (QR codes) for matching. That steals valuable platen space. I have no experience with OpenCV but my hunch was that there'd be a way to match a much smaller brightly colored dot or square. Do you think that is feasible (for scanning black and white text only books)? I'm thinking that since we know that there will always be four dots in a certain pattern, they could all be alike (no unique QR codes needed) and then identified based on their relative position in the image.

For an automated workflow a tool of this kind would have to work from the commandline (input: captured images, output: cropped and deskewed pages). All other postprocessing can be done well by ScanTailor already.

If a tool like this was narrowed for use as a "ScanTailor preprocessor" then it should ideally also work on all platforms where Scan Tailor works.

The above is not meant as complaints against your tool; I'm only thinking out loud from the POV of the workflow I was sketching.

Shaknum · Post by **Shaknum** » 07 Mar 2013, 09:14

I do wonder about using colored dots or something like that with OpenCV. You can easily use my software to try that out. One thing I did to really speed up and get better accuracy was to crop the picture into four quadrants: top-left, top-right, bottom-right, and bottom-left. You could certainly do something like that so OpenCV only has to look for one color at a time in each quadrant, you can even find the best color and use it in every corner. The QR-Codes have the benefit of providing DPI data, which is very nice to have and might be more difficult to get with just dots and OpenCV. Also, having worked previously with checkerboard detection in OpenCV, I haven't found it terribly reliable, though that may be my fault more than the software's. Keep us posted regarding whatever you come up with, and feel free to use my software to play around with OpenCV, I use it in ImageFileProcessor.mm.

abmartin · Post by **abmartin** » 09 Apr 2013, 18:35

Great stuff here!

If someone is interested in working on a scantailor preprocess, I'd love to know what you think of my basic approach. There are two steps I do not know how to automate. Namely, I'd love to be able to automagicly use a gray card without the few mouse clicks I do now. Also, using the excellent ppmunwarp tool to fix image distortions, my approach does spit out a great grid. I'm just too inept to figure out a way of using that grid to automatically generate dpi and have to use gimp's measuring tool... http://www.diybookscanner.org/forum/vie ... =19&t=2795

The only thing really missing from this is that I don't rotate the images until I get them in scantailor. However, that would be really easy to add into the script since there is a final imagemagick step.

Using what I have, perhaps one of you might be able to take out the manual steps, add in a renaming script (I really don't care, because thunar's batch renaming works perfectly fine for me and takes all of 10 seconds per directory), and some other goodies.

That script is the sum total of my computer abilities. Please tear it apart and build something better, because I'm out of my depth!

DIY Book Scanner

Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System

Re: Your Opinions For An Alternative Book Scanning System