I'm building a scanner to help digitize some books in K'ichee', one of the modern Mayan languages of Guatemala. Until recently I'd been working with a flatbed scanner, but my patience is starting to run out...
Here's one of the successes --- this is from a history of the Catholic church in Guatemala, scanned on a flatbed, with very nice despeckling and pretty okay deskewing from a recent beta version of Scan Tailor.
And here's a tougher case. This is from Tzonob'al Tziij, a collection of speeches that used to be given during the preparations for a K'ichee' wedding. That fancy background image is based on a traditional weaving pattern. It's pretty, but man, it confuses the bejeezus out of Scan Tailor. I took this with an overhead camera but no platen. I'm hoping that adding a platen will give me a straighter image and let me skip Scan Tailor altogether (though it may be that OCRing these pages will be just as hopeless).
The three dots in the top right corner are the page number in hieroglyphic numerals. It's a base-20 system: two dots and then one dot means (2x20)+1=41. Nobody does everyday math with the old numerals anymore, but they're catching on for Serious Use in Documents Of Cultural Significance --- rather like Roman numerals in Europe.
This is the colophon from the same collection of speeches. It's the date when the print run ended, put into the Maya calendar and written out in hieroglyphs. (The last page of the book has the date the print run began, written out the same way.) The calendar is still in occasional use --- not for everyday timekeeping, but in traditional astrology and medicine, which never entirely died out. But really, again, this is more like a Latin inscription in an English book: whether or not you can read it yourself, it gives the whole project an air of gravitas.
K'ichee' book digitization
Moderator: peterZ
-
- Posts: 290
- Joined: 20 Jun 2009, 12:19
- E-book readers owned: SONY PRS-505, Kindle DX
- Number of books owned: 9999
- Location: Grand Rapids, MI
- Contact:
Re: K'ichee' book digitization
since that background is a repeating pattern, couldn't you do some kind of 2D blind deconvolution to subtract it out?
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: K'ichee' book digitization
If you can get sharper images - more like the first page with the hatched pattern - we can help you design a thresholding operation to extract the hatches. The basic formula would be to say "all pixels below X should be white" and "all pixels above X should be white" and go with that. Steve also has a point that it can be treated in frequency space, though I don't think that's necessary just yet - I think if you get reasonably "sharp" images with your camera that we can threshold this stuff out.
I'll return to the US in two weeks and can help - you might also have a look at the thresholding work going on here - there's a thread about red text/colored text that has a number of fairly advanced techniques for getting the work.
I'll return to the US in two weeks and can help - you might also have a look at the thresholding work going on here - there's a thread about red text/colored text that has a number of fairly advanced techniques for getting the work.
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: K'ichee' book digitization
very quick example - color problems, but approaching something sane. in your case the number one thing is to get sharp images, because blur causes the borders of the letters to blend a little with the background. fortunately, it's not all that difficult.
- Attachments
-
- tzonobal_quickly_done_with_levels_tool_in_photoshop.png (408.21 KiB) Viewed 19562 times
-
- Posts: 63
- Joined: 29 Dec 2010, 14:51
- E-book readers owned: Nook, Kindle DX
- Number of books owned: 0
- Country: USA
- Location: Sandusky, OH
Re: K'ichee' book digitization
Using Photoshop's threshold tool, you can get pretty good results. This is the output converted to pdf. This was just a quick demo, so I'm sure you could get better results.
Acrobat's OCR isn't too bad, but it sees "lo" as "10":
Acrobat's OCR isn't too bad, but it sees "lo" as "10":
edit: This is what I get for taking my sweet time--Daniel beat me to it!20 K'a te b'aa 10 k'uuta chi ri q'ani ama' ak' saqi ama' ak', mi xoto ta chi 10 riib' upa ri utolok' umaske'l, xuya chi k'u 10 jun roq'iyaal keb' roq'iyaal.
21 Te k'u ri' 10, ri q'ani atun ch'ok saqi atun ch'ok, mi xpe chi 10 sin ranima; xtz'itz'ot chi k'u 10 pa ri upurnum k'isiis. xtz'itz'ot chi k'u 10 pa ri upurnum paarki, karaj ne' k'oo 10 pa ri upurnum pa'chaj, k'oo 10 pa ri usook pa ri upache' k .
- Attachments
-
- tzonobal copy.pdf
- (94.9 KiB) Downloaded 821 times
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: K'ichee' book digitization
yeah, but yours is much better.
Re: K'ichee' book digitization
Those threshold examples look fantastic. Yes, I'll definitely give that a shot.
-
- Posts: 596
- Joined: 06 Jun 2009, 23:57
Re: K'ichee' book digitization
Mayan cocktail napkins. Who'd've believed it...