Creating smaller PDFs

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs [Post edited several times]

Post by cday »

I think that this is basically a question of terminology, with different softwares using different terms to express the same ideas, ones which that may not be easy to express clearly in the few words used to name the option. I have in the past used Abbyy FineReader, Nuance OmniPage and more recently Adobe Acrobat's 'ClearScan', but not recent software versions.

My concept of the available searchable text options is that, when a saved searchable PDF file is open on the screen, normally either the original scanned bitmap image is displayed, or the image on the screen is the recognised text in the bitmap image displayed using a vector font or multiple fonts, in the same way that text in a word processor document is displayed.

When the original bitmap image is displayed, the option is referred to as 'text under the image' or something similar, because the recognised text is not visible. Thinking of the recognised text as being literally 'under' the image is perhaps not helpful, but one way or another when the file is searched identified text is highlighted in the displayed bitmap image. The recognition accuracy of the text layer cannot be seen directly, but can be determined by selecting an area of text and then pasting it into a text editor or word processor. Given a good quality scan, modern software can be expected to have very good recognition accuracy, but on a lower quality image the accuracy might be much lower.

When 'text over image' or a similarly named option is selected, when the file is viewed the image displayed on the screen is, I believe, the recognised text rendered using vector fonts, plus any correctly identified images such as photographs or graphics in the original scans. It will therefore display any text recognition errors. Be aware that the font or fonts seen may not be the fonts used in the original document, as other as-far-as possible closely matching fonts may have been substituted when the original fonts have not, possibly for licencing reasons, been embedded. If the text in the original bitmap scans has been discarded a substantial reduction in file size can be expected, but those original bitmap images will only be available in future if the original scans are archived for possible reuse if needed later. If the original full scans have not been discarded, the searchable text version will have a larger file size than the original file

In the case of Adobe Acrobat 'ClearScan', pages are displayed using vector fonts synthesised to closely match the bitmap fonts in the original scans, with the clever twist that when recognition is not considered sufficiently confident, uncertain text is displayed as an area of bitmap image. The full text can therefore be viewed, unlike in the above description of viewing a 'text over image' PDF file. The original scans are discarded, so minimal file sizes can be obtained with generally excellent quality when good quality scans are used as input.

The above is my basic understanding, perhaps BruceG can if necessary elaborate on it based on his more recent use of the softwares, and confirm the actual terms used for the various output options in Abbyy FineReader and Nuance OmniPage softwares.

Written with a heavy cold, and edited several times without yet being able to post definitive information... ;)
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

ABBYY Finereader uses the terms Text on Top or Image on Top. It has been a long time since I used OmniPage Ultimate it uses terms like front or behind but I think they refer to overlaying different elements on the main layer.
I did some tests to see what fonts ABBYY, OmniPage and ClearScan produced in the pdf outputs.
The original document, a double page of text from a book from 1912 about early gold findings in Victoria. The output files as the programs produced them with setting from previous work. No editing.
5KB 7KB - Some of the format is also lost. 345KB
The story of Browns and Scarsdale 1912 ClearScan.pdf
(344.33 KiB) Downloaded 43 times
5,641KB 115KB Details of fonts used
Fonts in processed doc.docx
(13.43 KiB) Downloaded 33 times
ClearScan fonts were embeded which would make the file larger. OmniPage and ABBYY used TrueType fonts.
I knew OmniPage produced small files but was surprised with the 5KB. Which included a background colour to match the original scan.

At different times I do both Image on Top and Text on Top, the difference in file size is not always the same, some times they are more similar.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

Thank you for your detailed tests performed with your usual diligence. I am still suffering from a heavy cold, or very probably actually a Covid infection, but you can be sure that with my usual diligence I am well vaccinated, and actually delayed my most recent booster to better correspond with the period of peak risk during the northern hemisphere winter.

First, the page you used for your comparative tests is with hindsight in one respect an unfortunate choice, since it consists solely of text, and does not contain any images. That seems important in the context of these tests, where the overall file sizes of alternative saved versions of searchable documents, such as Oliver's multipage files of mixed content, are concerned.

Regarding the naming of the alternative software output modes, I feel that the name of the Abbyy output option that you named 'Image On Top' could be considered misleadingly, as my feeling which is confirmed by the much smaller size of the file relative to the size of the alternative Abbyy output file you named 'Text on Top', is that the original scanned bitmap image is *not* present in the file. That of course significantly reduces the file size, but also means that unless the text that is displayed has been thoroughly proof-read, the displayed text may not be entirely accurate: it could contain typos or in the case of a poor scan even some areas of near gibberish. No doubt the names you used closely reflect Abbyy's names.

The OmniPage 19 output file is interesting and, from the very small file size that you commented on yourself, the original scanned bitmap image again cannot be included. It is also interesting in that a seemingly sensible decision has been made to represent the varying coloured background of the original page, due to I think to ageing, as a more pleasing uniform colour, which should certainly to some extent have further reduced the file size. I don't know if that is a selectable option. On the other hand, since the displayed text is the text as recognised by the OCR process, any recognition errors that have not been picked up, if the output has actually been proof-read, will also be displayed.

With respect to fonts in the Adobe Acrobat 'ClearScan' file, my understanding is that they are 'fonts' created as needed as the original bitmap image is analysed, and may possibly simply relate to identified 'glyths' rather than always conventional text characters, if that is a suitable term. As far as I am aware the fonts, of which there may be a large number listed in a longer document, may simply enable the original bitmap image to be displayed in a vectorised, scaleable, form. If you zoom in on an open 'ClearScan' PDF page, the characters rescale perfectly rather than to reveal individual pixels as would be seen when zooming-in on a bitmap image. The fonts listed cannot as far as I know be extracted from the file for examination.

The 'ClearScan' file posted preserves the background colouration of the original, which might not be wanted, possibly there is an option to suppress that, if not in your version then possibly in a later Acrobat version? That should, again, slightly reduce the file size. The file properties shown for your example 'ClearScan' file seem to indicate that it was created using Adobe Acrobat 9, the first version I think that provided a 'ClearScan' output option. My Acrobat version is Acrobat Standard XI, and I believe that successive versions refined the 'ClearScan' output produced, the current version attempting to identify the original fonts used and presumably embed them, which could potentially result in a further useful reduction in file size. I put 'ClearScan' in quotes because that use of that name was reportedly discontinued in a later Acrobat version.

My advice for someone in Oliver's position would be, when searchable text is needed and good quality flatbed scans are available, to consider how important identification of any text recognition errors is, and how much time can be spent on identifying and correcting any errors that occur. In favourable circumstances I would think that only minimal checking might be needed, enabling good quality output to be obtained with possibly substantially reduced effort. In those circumstances, use of a form of output that does not retain the original bitmap scan could substantially reduce the resulting file sizes obtained when a file contains a significant number of images. On the other hand, if there could for example later be a need to check the spelling of an unusual word or name, an image of the original page could be desirable.

It would also probably be best when pages of an original document have no background colour, to start text recognition with images with a clean white background, either as output from ScanTailor if used, or alternatively using an image enhancement options in the software used for text recognition.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

As regards my last post, Acrobat Clearscan has 4 different settings what it calls Downsample (72, 150, 300 & 600 dpi)
These is an example
Pages from ScarsdaleClearScan600.pdf
(550.52 KiB) Downloaded 31 times
Pages from ScarsdaleClearScan300.pdf
(323.1 KiB) Downloaded 36 times
Pages from ScarsdaleClearScan150.pdf
(154.39 KiB) Downloaded 36 times
Pages from ScarsdaleClearScan72.pdf
(103.94 KiB) Downloaded 45 times
My version of Acrobat is 9 Pro so things may have changed since then.

I also removed most of the text from the ABBYY text on top only leaving page numbers which saved 9KB.
So I think it does sit on a bitmap or simular layer.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

As I like real life examples, I have chosen a book that is still in copyright and is here only for the purpose of showing how different programs process pdf's. It is another book about what was a mining area, and it also has white pages which I do not often work with.
Acrobat besides Clearscan can also reduce the size of pdf's, one file of Adobe Clearscan, Omnipage & ABBYY Finereader have been through this process.
The original file was 74,654KB.
The ABBYY file was edited (as part my Archive work), no editing was done in the Omnipage or clearscan files. I do not know if Acrobat allows editing in newer versions.
File order is by file size.
Staghorn Flat ClearScansDownsample600.pdf
(28.74 MiB) Downloaded 39 times
Staghorn Flat ABBYY Image on Top.pdf
(23.85 MiB) Downloaded 29 times
Staghorn Flat ABBYY Text on Top.pdf
(23.75 MiB) Downloaded 43 times
Staghorn Flat ClearScansDownsample300.pdf
(17.6 MiB) Downloaded 32 times
Staghorn Flat ClearScansDownsample150.pdf
(7.25 MiB) Downloaded 35 times
Staghorn Flat ABBYY Text on TopSizeReduction.pdf
(5.21 MiB) Downloaded 31 times
Staghorn Flat Omnipage.pdf
(5.1 MiB) Downloaded 28 times
Staghorn Flat ClearScansDownsample72.pdf
(3.8 MiB) Downloaded 36 times
Staghorn Flat OmnipageSizeReduction.pdf
(2.45 MiB) Downloaded 34 times
I am no expert so there may well be other ways to reduce file size if that is important or required for different purposes. The end purpose needs to be determined and then try to work out how it can be achieved. Quality vs size is one of those things that changes depending on how the material is to be used.

A worthwhile exercise.
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: Creating smaller PDFs

Post by dpc »

BruceG wrote: 28 Dec 2022, 23:12 A worthwhile exercise.
Storage is pretty cheap these days, as is Internet bandwidth. Smaller files are certainly better, but you have to weigh the cost in time and effort to produce them.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

dpc wrote: 29 Dec 2022, 04:24
BruceG wrote: 28 Dec 2022, 23:12 A worthwhile exercise.
Storage is pretty cheap these days, as is Internet bandwidth. Smaller files are certainly better, but you have to weigh the cost in time and effort to produce them.
Yes, but I think anyone contemplating a reasonably large scanning project, and particularly one where the scanned material will contain many images, could benefit from gaining some insight into the often rather opaquely-named software output options. That understanding could allow better understanding of how searchable text will be displayed, including the possibility of it containing text recognition errors, and how alternative output options will affect overall file size. And time invested at an early opportunity could also benefit any future projects that might be undertaken.

Actually, I have long felt that there is a need for a tutorial that comprehensively sets out these matters once and for all. This and similar threads could be of some help to others facing the same issues and uncertainties in the future, but we are still really not in a position without further tests and analysis to provide that information, even in this increasingly lengthy thread which is unlikely to be able to set out definitive guidance in a clear and easily read form.
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: Creating smaller PDFs

Post by dpc »

Sure. I should clarify that I was speaking strictly from a file size standpoint.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

BruceG wrote: 28 Dec 2022, 22:37 Acrobat ClearScan has 4 different settings for what it calls Downsample (options 72, 150, 300 & 600 dpi)
When a bitmap image is downsampled, the number of pixels used to display the image is reduced without changing the displayed size of the image on the screen. The detail contained in the image is therefore reduced, which may or may not actually reduce the perceived quality of the displayed image on the screen. To as far as possible minimise any loss of perceived quality, images are typically downsampled by a factor that is a power of two, the image pixel dimensions being divided for example by 2 or 4. Downsampling has no relevance to the perceived quality of vector text or graphics, as character outlines automatically resize smoothly.

Downsampling an image can substantially reduce the image's contribution to the total size of the file, and so can be particularly beneficial when a file contains multiple images. But note that when a page has been scanned at 300 DPI, for example, enabling a downsampling option of 600 DPI, if that value is available, should have no effect on images that are lower resolution.

My experience when converting good quality scanned PDF files to 'ClearScan', for which the option is most suited, is that when a file is downsampled, the text in the file (which in a scanned original is also before conversion stored as a bitmap) often shows little or no obvious loss of quality when downsampled to 150 DPI. That is perhaps surprising in comparison with text when displayed as a bitmap. Equally, when scanning good quality pages that have no images, scanning at 300 DPI, which will generally take longer, may not result in higher quality 'ClearScan' text.

I also removed most of the text from the ABBYY text on top only leaving page numbers which saved 9KB.
So I think it does sit on a bitmap or similar layer. The story of Browns and Scarsdale 1912TexrPageNo only.pdf

The PDF format is a 'Page description language' that allows the layout and content of a page to be described in a way that enables the page to displayed as a bitmap on a screen, or understood by a printer. Text on the page, for example, can be specified by the position, size, colour and other attributes of the text. Images to be included on the page are specified by the location and size of a rectangle in which they are to be displayed, plus the image in the form of either a bitmap matrix, or characters defined by vector graphic descriptions.

When a software output option is selected that produces a very small file size, I think that you can be sure that if the file contains any bitmap content, it is confined to representing only any very small elements on the page, such as possibly in this case the remaining page numbers.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

@Oliver: Have you had any more thoughts on your proposed project?

It seems that both Adobe Acrobat and Abbyy Fine Reader are currently available on monthly plans as an alternative to the standard 12-month plans, needless to say at higher monthly rates. In addition, Adobe Acrobat seems to be available as a 'Student and Teacher' version at a reduced price, although probably only on an annual plan.

Someone who is organised and well-prepared might therefore be able to complete a project, when they have time available, using optimal tools at a reasonably cost, or possibly even within the Adobe one week free trail period! The software documentation is all available online, although possibly not the easiest read. My earlier Acrobat XI Standard version I think supports batch conversion of source PDF files, so if the source files are of sufficient quality to avoid the need in most cases to correct OCR errors, multiple files could be loaded and left to run to completion before final checking.
Post Reply