Creating smaller PDFs

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

Oliver
Posts: 7
Joined: 19 Dec 2022, 14:14
E-book readers owned: Kindle
Number of books owned: 300
Country: Deutschland

Creating smaller PDFs

Post by Oliver »

Hello everyone,

I am quite new to book scanning and I wasn't expecting that there would be a whole community about that topic. It's gorgeous!
I want do digitalise few books and used a scanner in the library of my university. The scans were pretty good but they were just scans, no digitalised books. So I researched a bit and got to know this forum and ScanTailor and I have to say, this program is just fascinating. It does what I want and it does it with a great accuracy.
I just have one problem with ScanTailor: the pdfs a way too large for my usage.
A scan of a book with about 100 pages, 20 of them with pictures, is about 250mb big. That's huge!
I tried already to limit the size of the file by turning the dipt to 300x300 and I am using the Greyscale for all pages except thos with images. I want the images in colour.

So I am wondering, if you are able to recommend a program/technique (at least a program free to use) to make those files smaller? I was wondering if this can be done with OCR? It is not important to preserve the exact font of the book.
But I worry, that this won't be easy. Those are books in italian and spanisch mixed with a bit of Latin and a lot of biological terms.

Greetings
Oliver
cday
Posts: 445
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

The filesize of a PDF depends on the image DPI, colour mode and the compression mode used when as, is typically the case, the page images are bitmaps (consisting of pixel arrays). If the page images can be vectorised, file sizes can often be much smaller. In addition, some more sophisticated software, as far as I know only available commercially, can also depending on the nature of the page content, produce pages with mixed bitmap and vector content which can reduce file size when the page is saved in a format such as PDF or DjVu which support that. For present purposes, we are probably in the first instance at least considering your bitmap images output from ScanTailor.

Usually grayscale images have file sizes only slightly smaller than colour images, but file sizes for black and white images when optimum compression is used are typically much smaller. However, when an image is converted to black and white the quality of text may be significantly degraded, due to the loss of 'anti-aliasing' which can subjectively enhance the quality of text rendered in grayscale or colour.

You are presumably using Windows, are you? And do you, maybe at university, have access to any commercial software that could be useful, particularly Adobe Acrobat?

If the images you have from ScanTailor are satisfactory other than for the file size, there should be various ways to reasonably easily reduce the file size with little or no loss of image quality, but understand that files with many pages of images scanned at a reasonable DPI are more or less inevitable fairly large unless very sophisticated processing can be used.

It might be helpful if you are able to upload some representative sample pages, preferably in PDF format, either as attachments, or if too large, via a download link.
cday
Posts: 445
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

cday wrote: 20 Dec 2022, 17:34 It might be helpful if you are able to upload some representative sample pages, preferably in PDF format, either as attachments, or if too large, via a download link.
What might be most useful would be a PDF file with maybe just ten or so pages that represent the typical content of a small book: that would be easy to examine, and to possibly test using alternative compression or other settings. Maybe you could easily create a file like that from some of your existing scans?
Oliver
Posts: 7
Joined: 19 Dec 2022, 14:14
E-book readers owned: Kindle
Number of books owned: 300
Country: Deutschland

Re: Creating smaller PDFs

Post by Oliver »

Hi,

sorry for my late reply. I put some pages together. I forgot to say, that this book also has a lot of black and white drawings and that I actually processed it with the Black/White setting and not the greyscale in ScanTailor. I just mixed them up...
Some pages don't look that well, but I already did some new and better scans of the book, but I didn't processed them yet.
FND excerpt.pdf
(2.38 MiB) Downloaded 47 times
Furthermore, I am just a student at my university and I dont have any access to any programms besides Windows Office.
If you can recommend a programm you can buy, I probably would, if it is not too expensive and if there isn't, I would just subscribe to Adobe for a month or so to get these books done.

Greetings
Oliver
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

Oliver
You do not mention how you want to use the books, ie. computer screen or some other device. To get the smallest file size would be to go with text and image only for photos.
I do work for an archive and the image of a page is important so the files are big to very big. A 100 yrs of weekly newspapers with 1000 pages per year make for large files. This is not a problem for modern computers with extra storage.
I use Abbyy Finereader which recently moved to the subscription model. A free trial is available, also is true for Acrobat. Which is good for creating an index of all pdf material if you wish to search all the pdfs at the same time.

I have used Abbyy Finereader 15 to process your file. No editing has been done.
Here are the results.
FND excerptAbbyy.pdf
OCR with image on top
(1.37 MiB) Downloaded 34 times
FND excerptAbbyy text.pdf
OCR with text on top
(1.18 MiB) Downloaded 32 times
FND excerptAbbyy.docx
OCR saved to Word
(496.33 KiB) Downloaded 25 times
FND excerptwordtopdf.pdf
Same Word file saved as pdf
(874.62 KiB) Downloaded 30 times
Doing OCR reduces file size.
Converting to Word has produced the smallest size - may also be the best format if you wish to edit the file. I do this in Abbyy if required.

At the Archive I use a Avision book scanner FB6280E which is able to do OCR as part of saving a pdf file. It may be worth while to checkout what the scanners at Uni are able to do. Have you had a look at 'archive.org' they many books that are out of copyright that would save you scanning and in different formats.
Oliver
Posts: 7
Joined: 19 Dec 2022, 14:14
E-book readers owned: Kindle
Number of books owned: 300
Country: Deutschland

Re: Creating smaller PDFs

Post by Oliver »

Hi Bruce,

thank you very much! The result off the text.pdf file is amazing. It's also very convenient that it is possible to edit the .word file. If think, I will subscribe to this program when I have all the books scanned I nead. Do you know, if the trial version is also able to do this?

Greetings
Oliver
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

Hi Oliver
I had a look at the Abbyy website and there are limitations:
7 days and 100 pages saved.

" 7 days full functionality for working with PDF documents like editing, commenting, and document comparison."
" Saving conversion results after applying OCR (including automated conversion in Hot Folder) for 100 pages total."

So you can do everything but are limited to an output to 100 pages. My reading of this is that you could edit/process more and save as a project file but limited to 100 pages as OCR files like pdf, word, excel etc.
If you plan editing with Abbyy it has dictionaries for the languages of the sample pages, new words can be added to a user dictionary as I expect Word does also.
If you have a scanned book that you are happy with, I could process it for you into different formats so you can see what Abbyy can do.
cday
Posts: 445
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

Oliver, I have had a quick look at the file that you uploaded on my Linux laptop, since most of my PDF software is on my now little used Windows 7 computer which I would have to set up. My initial impression is that as you have used ScanTailor, which is now quite sophisticated software, and have probably used about optimum compression settings of JPEG for colour page images, and I suspect CCITT G4 for black and white pages, there is probably little scope for substantially reducing document file sizes when saving the scans as page images.

As I indicated originally, bitmap page images at reasonable resolution are inevitably quite large. However, very tentatively, I think that you might be able to save the colour page images with slightly more compression without losing significant quality when the images are viewed on a screen, and if the black and white pages are not saved as TIFFs with CCITT G4 compression, there should certainly be something to gain there. On the black and white pages I thought that I could see noticeablle loss of text quality due to the absence of antialiasing when the colour was removed, but if the text quality meets your needs that should not be an issue.

Saving optimised page images is certainly the direct way to create a viewable copy of a book, and has the advantage of preserving an image of each page that can be viewed later if any query arises about the content of the page. However, if it is desired to make the text searchable as it often is, the process becomes more complicated, and when using camera images the quality of the text characters may mean that substantial time is required to edit the output of the recognition results to obtain acceptable accuracy.

If images of the original pages are not required, then very much smaller files can be created as indicated in BruceG's post above. But I would proceed carefully and be sure to evaluate how much effort would be required to complete your project before committing to any expensive software. As the cost of memory has historically continued to fall steadily, you might find that if searchable text is not a high priority, storing optimised but still large PDF files on a suitable drive is overall the most practical solution. Maybe backed up to the cloud.
cday
Posts: 445
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

Second post today:

I have just read your first post again and noted that you scanned on a flatbed scanner at 300 DPI, which should produce much better images than camera scans. Sorry, I am a bit overloaded at the moment, but will set up my Windows 7 computer when I can clear a space!

With that information, the alternative output formats indicated above should work at least reasonably well, I think, but depending on how much content you have to scan, you should perhaps still think carefully about the time required for proof-reading, if text [largely] free of errors is important.

The scans you are producing should alternatively provide good quality input for the Adobe Acrobat searchable text output option that used to be referred to as 'ClearScan', I understand that name is no longer used in the current interface. You will also have an image of the original page, and should also get a large reduction in the size of pages containing text. I'll try to test that for you when I have time. But be aware that Adobe Acrobat is expensive, that the minimum subscription period is, I think, one year, and that if there are currently two versions, the lower priced version will probably not support the 'ClearScan' output format.
cday
Posts: 445
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

Third post today:

Showing my enthusiasm for Adobe ClearScan, I have now set up my Windows 7 computer and converted Oliver's test file to ClearScan vectorised text.

Oliver's immediate need is to reduce the file size of his PDFs, and for this sample file containing a number of pages of text plus a smaller number of pages with medium-size colour images, the reduction was from 2.5 MB to 2.1 MB, so going in the right direction but not large. ClearScan works best on documents containing many pages of good quality text when it can produce produce dramatic filesize reductions.

I also tested the ClearScan option to downsample images from the starting value of 300 DPI to 150 DPI, and as expected there was a further filesize reduction, a more useful one from 2.5 MB to 1.4 MB. Interestingly, the reduction in image quality at 150 DPI was not as great as might be expected, and I think that given the importance of overall filesize, the slight quality loss could be considered acceptable. The loss was most noticeable as would be expected on the image annotated with text, but the text remained readable. If you have a tabbed PDF viewer it should be easy to compare zoomed-in images of both PDF versions.

I don't know how important searchability is for Oliver's needs, but checking one small sample of the text output by dragging the 'Select' tool across it, all text in the sample was converted to vector text; when the quality of text is considered insufficient to be confident of recognition accuracy, a small bitmap of that area of the source image is substituted. I haven't checked the actual accuracy of the output due to the foreign language used, but that can be easily done by pasting a sample into a word processor or text editor. Interestingly, as I have seen before, ClearScan sometimes straightens areas of text that are not quite level, or in this case actually slightly curved.

My software is Adobe Acrobat XI Standard, and ClearScan under its new designation has reportedly been further refined in later versions, but importantly the option is probably only now available in the Pro version.

If Oliver finds this approach preferable to the above 'extracted text' options, and does not have too many files to convert, and would not require any recognition errors to be detected and corrected, I could possibly do the conversions for him, and if necessary maybe some other forum members with Acrobat might be prepared to assist, it doesn't take long to run a PDF file straight through.

FND excerpt.pdf
(2.38 MiB) Downloaded 26 times
FND excerpt_ClearScan_300DPI.pdf
(2.03 MiB) Downloaded 33 times
FND excerpt_ClearScan_150DPI.pdf
(1.31 MiB) Downloaded 26 times
Post Reply