Reading/Writing Same PDF Increases Size

Discussions about PDF management.
Post Reply
Posts: 7
Joined: Mon Nov 11, 2019 8:57 pm

Reading/Writing Same PDF Increases Size

Post by risotoh985 » Sun Dec 01, 2019 12:28 am


I've noticed an interesting and very strange behavior:
When I read a PDF file previously saved by GdPicturePdf with GdPicturePdf again, remove all hidden text, perform a new OCR recognition and save it again the file gets bigger and bigger every time I do this.

Here is a short sample code that reproduces this problem:

Code: Select all

            for (var i = 0; i <= 10; i++)
                var gdPicturePdf = new GdPicturePDF();

                gdPicturePdf.OcrPages("*", 0, "eng+deu", @"C:\GdPicture.NET 14\Redist\OCR", string.Empty, 300, OCRMode.FavorAccuracy, int.MaxValue, true);
                gdPicturePdf.SaveToFile($"sample{i + 1}.pdf", true, false);
I've also attached you a full sample project for easy reproducing.

As you will see when running the sample project the "sample1.pdf" (after first saving it with GdPicturePdf) is 231 KB and after the 10th iteration the file "sample11.pdf" increased to 289 KB!

What's the reason for this?
As the hidden text is always cleared before the next OCR round, I would expect the file size to stay the same.
Why does it increase more and more every time?

(223.59 KiB) Downloaded 1 time
(118.65 KiB) Downloaded 1 time

Posts: 7
Joined: Mon Nov 11, 2019 8:57 pm

Re: Reading/Writing Same PDF Increases Size

Post by risotoh985 » Mon Dec 02, 2019 6:21 pm

An additional info regarding this:
I've found out in the meantime that the reason for the size increase is that with every round it embeds one additional font into the PDF file.

In the resulting "sample1.pdf" (after the first round) there is only a single font in the PDF (free tool PDF-Analyzer used for this):
GdPictureFontsEmbedded1.PNG (2.57 KiB) Viewed 38 times
But in "sample11.pdf" (after the 10th round) there are also 10 fonts embedded in the PDF:
GdPictureFontsEmbedded2.PNG (10.01 KiB) Viewed 38 times
I hope this helps to better understand this issue!

Maybe the

Code: Select all

function could remove also the embedded font related to the removed hidden text?
Or is there another function to remove all embedded fonts in the document?


Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest