Reading/Writing Same PDF Increases Size

Discussions about PDF management.
Post Reply
risotoh985
Posts: 7
Joined: Mon Nov 11, 2019 8:57 pm

Reading/Writing Same PDF Increases Size

Post by risotoh985 » Sun Dec 01, 2019 12:28 am

Hi,

I've noticed an interesting and very strange behavior:
When I read a PDF file previously saved by GdPicturePdf with GdPicturePdf again, remove all hidden text, perform a new OCR recognition and save it again the file gets bigger and bigger every time I do this.

Here is a short sample code that reproduces this problem:

Code: Select all

            for (var i = 0; i <= 10; i++)
            {
                var gdPicturePdf = new GdPicturePDF();
                gdPicturePdf.LoadFromFile($"sample{i}.pdf");
                gdPicturePdf.RemoveHiddenText();

                gdPicturePdf.OcrPages("*", 0, "eng+deu", @"C:\GdPicture.NET 14\Redist\OCR", string.Empty, 300, OCRMode.FavorAccuracy, int.MaxValue, true);
                gdPicturePdf.SaveToFile($"sample{i + 1}.pdf", true, false);
            }
I've also attached you a full sample project for easy reproducing.

As you will see when running the sample project the "sample1.pdf" (after first saving it with GdPicturePdf) is 231 KB and after the 10th iteration the file "sample11.pdf" increased to 289 KB!

What's the reason for this?
As the hidden text is always cleared before the next OCR round, I would expect the file size to stay the same.
Why does it increase more and more every time?

Thanks
Riso
Attachments
sample0.pdf
(223.59 KiB) Downloaded 1 time
GdPictureTest3.zip
(118.65 KiB) Downloaded 1 time

risotoh985
Posts: 7
Joined: Mon Nov 11, 2019 8:57 pm

Re: Reading/Writing Same PDF Increases Size

Post by risotoh985 » Mon Dec 02, 2019 6:21 pm

An additional info regarding this:
I've found out in the meantime that the reason for the size increase is that with every round it embeds one additional font into the PDF file.

In the resulting "sample1.pdf" (after the first round) there is only a single font in the PDF (free tool PDF-Analyzer used for this):
GdPictureFontsEmbedded1.PNG
GdPictureFontsEmbedded1.PNG (2.57 KiB) Viewed 38 times
But in "sample11.pdf" (after the 10th round) there are also 10 fonts embedded in the PDF:
GdPictureFontsEmbedded2.PNG
GdPictureFontsEmbedded2.PNG (10.01 KiB) Viewed 38 times
I hope this helps to better understand this issue!

Maybe the

Code: Select all

RemoveHiddenText()
function could remove also the embedded font related to the removed hidden text?
Or is there another function to remove all embedded fonts in the document?

Thanks
Rizo

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest