Loading...

Document sizes after PdfAddGdPictureImageToPdfOCR

Support for GdPicture Tessaract Plugin.

Document sizes after PdfAddGdPictureImageToPdfOCR

Postby lbleicher » Thu Dec 22, 2011 7:13 pm

Hi-

My application executed OCR on scanned image PDFs to create searchable PDF/A output. However, I have noticed that the result of the code below takes as much as 10x disk space as the original. Can anyone explain why? Am I missing a step somewhere?

Attached is a sample PDF that goes from 12k before the process to 700k after.

Thanks,
Leo

Code: Select all
       
Dict = "eng"

        PdfID = oGdPictureImaging.PdfOCRStart(OutputFilePath, True, "", "", "", "", "DocDigester")
        oGdPictureImaging.OCRTesseractSetPassCount(2)

        If InputPDF.LoadFromFile(pdfPath, False) = GdPicture.GdPictureStatus.OK Then
            For i As Integer = 1 To InputPDF.GetPageCount()
                InputPDF.SelectPage(i)
                ImageID = InputPDF.RenderPageToGdPictureImage(200, True)

                curPageImage = InputPDF.ExtractPageImage(i)
                inPgPD = myPage.GetBitDepth(curPageImage)
                Select Case inPgPD
                    Case 1
                        oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W
                    Case 8
                        oGdPictureImaging.ConvertTo8BppGrayScale(ImageID) 'grayscale
                    Case 24
                        'do nothing default is 3x8bit color
                    Case Else
                        oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W
                End Select

                Dim pgText As String = oGdPictureImaging.PdfAddGdPictureImageToPdfOCR(PdfID, ImageID, Dict, sciroot & "docdigester\bin\win", "")
                oGdPictureImaging.ReleaseGdPictureImage(ImageID)
                oGdPictureImaging.ReleaseGdPictureImage(curPageImage)
            Next i
        Else
            'report out reason for problem.
            Dim errCode As Integer = InputPDF.GetStat()
        End If
        InputPDF.CloseDocument()
        oGdPictureImaging.PdfOCRStop(PdfID)

Attachments
123456A.zip
(5.96 KiB) Downloaded 19 times
lbleicher
 
Posts: 14
Joined: Fri Nov 04, 2011 4:51 am

Re: Document sizes after PdfAddGdPictureImageToPdfOCR

Postby Loïc » Thu Dec 22, 2011 7:22 pm

Hello Leo,

If your input PDF is image based you should consider to replace:

Code: Select all
ImageID = InputPDF.RenderPageToGdPictureImage(200, True)


by:

Code: Select all
ImageID = InputPDF.RenderPageToGdPictureImageEx(200, True)


Let me know if this is better.

Kind regards,

Loïc
Loïc Carrère, support team.
www.orpalis.com
User avatar
Loïc
Site Admin
 
Posts: 4445
Joined: Tue Oct 17, 2006 10:48 pm
Location: France

Re: Document sizes after PdfAddGdPictureImageToPdfOCR

Postby lbleicher » Fri Jan 13, 2012 7:58 pm

Hi Loic-

Thanks for the suggestion, but that does not help. I already had a select/case statement to do conversion back to the original bit depth (though the RenderPageToGdPictureImageEx method is a better way).

I still have this 11k input pdf coming out as 1148k!!!

Is it possible that the JPEG compression is not being applied? Could this be a result of generating the output as a PDF/A?

How could I make sure compression is being applied to the PDF created by the PdfOCRStart statement?

Thanks,
Leo
lbleicher
 
Posts: 14
Joined: Fri Nov 04, 2011 4:51 am


Return to GdPicture Tesseract OCR Engine Plugin

Who is online

Users browsing this forum: No registered users and 1 guest