Loading...

OCR pdf perfomance

Support for GdPicture Tessaract Plugin.

OCR pdf perfomance

Postby mirkop » Mon Nov 07, 2011 6:43 pm

Hi,
I have installed the lastest version of GdPicture 8.4.
I read the text from pdf file, but it's too slow.
I have some files (total 85 MB), and for read the text of all pdf the process during 2 hours.
Foreach file use this code:
Code: Select all
GdPicturePDF oPDF = new GdPicturePDF();
                    if (oPDF.LoadFromFile(path, false) == GdPictureStatus.OK)
                    {
                       
                        int dimCount= oPDF.GetPageCount();

                        for (int i = 1; i <= dimCount; i++)
                        {
                            if (i > 1)
                            {
                                Debug("SELEZIONE PAGINA " + i);
                                oPDF.SelectPage(i);
                            }
                            Debug("RenderPageToGdPictureImage");
                            m_ImageID = oPDF.RenderPageToGdPictureImage(200, true);

                            Debug("OCRTesseractReinit");
                            oGdPictureImaging.OCRTesseractReinit();

                            Debug("OCRTesseractDoOCR");
                            s += oGdPictureImaging.OCRTesseractDoOCR(m_ImageID, "ita", _dirOCR, "");
                            if (oGdPictureImaging.GetStat() != GdPictureStatus.OK)
                                Debug("[" + path + "] Error on page " + i + ": " + oGdPictureImaging.GetStat().ToString());
                            Debug("OCRTesseractClear");
                            oGdPictureImaging.OCRTesseractClear();
                        }
                        oPDF.CloseDocument();
                    }


How can i improve the performance?

Thank you

Mirko
mirkop
 
Posts: 37
Joined: Wed Jun 24, 2009 5:38 pm

Re: OCR pdf perfomance

Postby Loïc » Tue Nov 08, 2011 5:36 pm

Hi Mirkop,

You can decrease the PDF dpi rendering or OCR your document in a multi-threaded application.

Steps are:

- Split the document (one new doc per page)
- in X threads create 1 PDF/OCR per page
- At the end, compose a new PDF by merging all document produced

We should deliver such demo application within the next release.

Kind regards,

Loîc
Loïc Carrère, support team.
www.orpalis.com
User avatar
Loïc
Site Admin
 
Posts: 4445
Joined: Tue Oct 17, 2006 10:48 pm
Location: France

Re: OCR pdf perfomance

Postby mirkop » Tue Nov 08, 2011 5:57 pm

Hi Loic,

thank you for your reply.
I don't need to create a new file, but only exctract the ocr text.

Is there a way to improve the performace?

Mirko
mirkop
 
Posts: 37
Joined: Wed Jun 24, 2009 5:38 pm

Re: OCR pdf perfomance

Postby mirkop » Thu Nov 10, 2011 4:33 pm

any suggestion ?

Mirko
mirkop
 
Posts: 37
Joined: Wed Jun 24, 2009 5:38 pm

Re: OCR pdf perfomance

Postby Loïc » Thu Nov 10, 2011 5:11 pm

Mirko I already provided 2 suggestions:

1- Decrease your PDF rendering resolution.

2- use multiple threads splitting your input document.

I don't see other possible way. I will try to provide a demo for (2) as soon as I can.
Loïc Carrère, support team.
www.orpalis.com
User avatar
Loïc
Site Admin
 
Posts: 4445
Joined: Tue Oct 17, 2006 10:48 pm
Location: France


Return to GdPicture Tesseract OCR Engine Plugin

Who is online

Users browsing this forum: No registered users and 1 guest