Scan Tailor

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.

Moderator: peterZ

Locked
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Scan Tailor

Post by daniel_reetz »

holy CRAP Rob, these images look AMAZING. I am basically writing this comment only to express TOTAL AMAZEMENT at how awesome this is!!!
cidrolin
Posts: 5
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by cidrolin »

Tulon wrote:Well, the book scanning community in Russia is quite prolific. I would say no less than 10 scanned books get published daily, although some of them are beautification efforts on scans released earlier. The largest and the oldest legal site is http://www.lib.ru It exists since 1994 and hasn't changed its design since then :)
Then we have less legal sites, that don't try to hide themselves though.
A little bit off topic, but...

I'm very interested by these book scanning community.

That's not really legal, I've found (in a 5 sec search) some texts of Emile Ajar (1979), and one book with :
Автор НЕ разрешает публикацию данного произведения в других сетевых библиотеках. Not autorised publication
and the full text following !

But that's not the point. These books are in text format, so scanning, OCR, proof reading and so on. It's a lot of work. Who is actually doing this ? Members ? It's seems hard to think that. Maybe professionals scanners paid by the ads ?

Do you know ?
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Scan Tailor

Post by Tulon »

Hi guys, I am back. It's not that I've been away, it just proved to be very easy to stop being sent notification messages from this forum. All you need to do is ignore one, even one that notifies you of a post in another thread, which I am not that interested in. I'll try not to fall into this trap again.

Rob, you are doing a great job! I am amazed at how much you have done while I was away. Myself I've been working on manual picture zones. I expect a new release of Scan Tailor in 2-3 weeks.

cidrolin,
Actually the Russian text you quoted says more than the English version. Namely, it says that publication on other online resources is prohibited. That means that particular site got a personalized permission from the copyright holder to host that material.
As for who does the scanning, etc, the situation seems to be as follows:
* Legal sites mostly host textual versions of books. I am not sure who actually does all the scanning/OCRing work, but in some cases it's authors/publishers themselves who provide the text. Many authors actually understand that being unknown is much worse than being pirated. For example, this page lists authors and publishers who gave explicit permission to lib.ru to host their material. BTW, AFAIK (I may be out of date on that subject), books published before 1974 in Russia are not protected by copyright.
* Less legal sites mostly host books in DJVU format, sometimes with an embedded OCR layer. Such a layer is used only for searching / copy-pasting, so it doesn't need to be really accurate. Making this kind of books is much easier, and that's what Scan Tailor is targeted at.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
cidrolin
Posts: 5
Joined: 04 Mar 2014, 00:52

Re: Scan Tailor

Post by cidrolin »

Tulon wrote: Namely, it says that publication on other online resources is prohibited. That means that particular site got a personalized permission from the copyright holder to host that material.
As for who does the scanning, etc, the situation seems to be as follows:
* Legal sites mostly host textual versions of books. I am not sure who actually does all the scanning/OCRing work, but in some cases it's authors/publishers themselves who provide the text. Many authors actually understand that being unknown is much worse than being pirated. For example, this page lists authors and publishers who gave explicit permission to lib.ru to host their material.
OK. In this case, editors gave texts and permission (but... in the case of Ajar, I'm pretty sure that french editor didn't gave is permission, French editors are really suspicious about numeric books. That's not the point).
BTW, AFAIK (I may be out of date on that subject), books published before 1974 in Russia are not protected by copyright.
He he... In my childhood, I could read on every (french book) "Tous droits de reproduction réservés, y compris pour l'URSS", "Copyrighted even for USSR". I learn one (or more) thing each day !
Tulon wrote: Less legal sites mostly host books in DJVU format, sometimes with an embedded OCR layer. Such a layer is used only for searching / copy-pasting, so it doesn't need to be really accurate. Making this kind of books is much easier, and that's what Scan Tailor is targeted at.
Yes. That's that kind of communauty site which interess me. Not for the illegal aspect, but for the sharing . I would like to know who made the scanning; Like mp3 in P2P network, where a small amount of people actually made the rip ? But in the case of book, it's a lot of work : is there really numerous people which scan book at home and post them on the internet? Altruism...
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Scan Tailor

Post by Tulon »

cidrolin wrote: Yes. That's that kind of communauty site which interess me. Not for the illegal aspect, but for the sharing . I would like to know who made the scanning; Like mp3 in P2P network, where a small amount of people actually made the rip ? But in the case of book, it's a lot of work : is there really numerous people which scan book at home and post them on the internet? Altruism...
I actually have no idea of how many people scan stuff. A few dozens maye. Actually Scan Tailor was created to make it easy for Joe average. Before that, only very dedicate people could manage it. Well, there were dedicated people who used Scan Kromsator and produced quality scans, and there were casual people who used FineReader or Acrobat or whatever. Their scans weren't good at all. In fact, it seems that most Scan Tailor users work on those shitty scans made by others, rather than on their own scans. Meanwhile the dedicated people are continuing to use Scan Kromsator. The alternative explanation would be that Scan Tailor is so good at processing raw scans that people working on them never feel the need to write to Scan Tailor's forum.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Scan Tailor

Post by daniel_reetz »

The alternative explanation would be that Scan Tailor is so good at processing raw scans that people working on them never feel the need to write to Scan Tailor's forum.
I favor this explanation. ;)
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Scan Tailor

Post by daniel_reetz »

rob wrote:Sadly, I'm going to have to develop my own undistortion step, since the algorithm described in the CTM paper is highly inscrutable. It makes very little sense, and indeed it is not clear how it is supposed to handle areas where there are no detected lines (as in the great blank area in the 2nd image above). Further, the algorithm describes a way to transform the image x,y into an undistorted transformed image x',y', which is usually the wrong way to do transformation, since it tends to leave gaps where some x',y' coordinates don't get mapped from an integral untransformed x,y.

But, I have half a plan. I'm just waiting for inspiration to strike before I get the other half of the plan :)
There's this cool guy named Tom Sharpless who recently mentored a GSOC project that was all aboutstraight-line calibration of cameras. It used just two lines as input, but I can't remember all the other details. I'm going to email him and see if he thinks it would be applicable to our project. It's in the same software package thatbkrpr mentioned in another thread.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Scan Tailor

Post by rob »

Well, I think it's done. Tulon -- welcome back, just in time!

So I've got all the math worked out, and tested it on a few images in Java. It seems to work well, but I'm sure there will be other images where it doesn't work so well -- such as double-column pages, and pages with lots and lots of randomly placed graphics. It will probably work well on pages of equations, as long as there is some explanatory text that stretches across the page somewhere. It will NOT work on pages where the text is too badly skewed, or where the text runs up and down the page rather than across it.

Now what I need to know is the location in the code to put the algorithm, how to get the pixel data out of the current final step (which has to be 1bpp) and pass it back to Scan Tailor.

The GUI part is totally beyond me. I was hoping that Tulon could wrap something nice around it... and also mark it as an experimental step!

--Rob
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Scan Tailor

Post by Tulon »

Great to hear that, Rob! I really appreciate your effort.

I think for now the best place to put this functionality is to the Output stage directly, putting a checkbox to enable it, just like with despeckling. The code would be called somewhere from OutputGenerator::processImpl(), probably next to calls to despeckleInPlace(). If we do it like that, it will only really work in B/W mode, but we'll fix that later. If you prefer me to port your Java code to C++, I am willing to do that, but give me 2-3 weeks first to make the next release happen.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Scan Tailor

Post by rob »

I can take a stab at working on the C++ code. At the very least, it will give you a head start if I can't seem to integrate it. Let me take a look at the part of the code you indicated, and I hope you don't mind any questions!

--Rob
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
Locked