NullReferenceException when doing PDF OCR

Discussions about Tesseract OCR integration in GdPicture.
Post Reply
attila1977
Posts: 5
Joined: Fri Jan 20, 2017 4:54 am

NullReferenceException when doing PDF OCR

Post by attila1977 » Mon Feb 13, 2017 5:23 am

I'm doing bulk PDF to PDF OCR conversion by using GdPicture .NET sdk v12.0.57.

My application does OCR one by one according to a predefined Image PDF file list.
It is running on Windows Server 2008 R2, 2 x quad core processor, the OCR thread is 15.
The application crashed after 40,000 pages OCR, and I saw a error message in Event Viewer.

Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException
Stack:
at gdpicture_᠗.gdpicture_ᜀ(GdPicture12.Internal.Imaging.GdPictureBitmap ByRef, Int32, Int32, GdPictureRAWColorPalette, Byte[], Byte[])
at gdpicture_ហ.gdpicture_ᜀ(Byte[], gdpicture_᠖)
at gdpicture_ហ.gdpicture_ᜂ()
at gdpicture_ហ.gdpicture_ᜀ(gdpicture_ឧ, Boolean ByRef)
at gdpicture_ស.gdpicture_ᜀ(Int32, gdpicture_ᠯ ByRef)
at GdPicture12.GdPicturePDF.ExtractPageImage(Int32)
at GdPicture12.GdPicturePDF.gdpicture_ᜀ(Int32 ByRef, Boolean, Boolean, Boolean, Boolean, Boolean)
at GdPicture12.GdPicturePDF.gdpicture_ᜀ(Int32, System.String, System.String, System.String, Single, Boolean, gdpicture_២, Boolean, Int32)
at GdPicture12.GdPicturePDF.gdpicture_ᜀ(System.String, System.String, System.String, System.String, Single, gdpicture_២, Int32)
at GdPicture12.GdPicturePDF+ᜁ.gdpicture_
()
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.ThreadHelper.ThreadStart()

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: NullReferenceException when doing PDF OCR

Post by David » Tue Feb 14, 2017 3:03 pm

Hi,

Thank you for contacting us.

May I ask you to shared the required material so I can reproduce the issue on my end? Including the input PDFs and a code snippet.

Thank you

David

attila1977
Posts: 5
Joined: Fri Jan 20, 2017 4:54 am

Re: NullReferenceException when doing PDF OCR

Post by attila1977 » Wed Feb 15, 2017 5:00 am

Hi David,
Thank you for your reply.
I'm sorry that I'm not able to provide the image PDFs, because they are confidential documents.
Basically, these PDF files contain 300-800 pages scanned images, page size from A4-A3.
Most of them are black and white, but some PDFs may have 10% of 24 bit color images.
We are using our application to do bulk OCR process for these image PDFs, total number of PDF files is around 1000. We created a PDF list, and our application will do OCR according to this list.
We noticed that there was an error occurred when the application ran more than 12 hours. Gdpicture did not throw exception but crashed directly, we can only see the error message in Windows Event Viewer. When we ran the application again, it continued to do OCR without problem, and last for another 10 hours or even 2 days. This is a random error and I don't know how to prevent it.

Below is my code snippet:

Code: Select all

public void Start() {
            lock (this) {
                CanStartOCR = false;
                if (_OCREntity != null)
                {
                    Console.WriteLine(DateTime.Now + " OCR Start");
                    _nativePdf.OcrPagesProgress += _nativePdf_OcrPagesProgress;
                    _nativePdf.OcrPagesDone += _nativePdf_OcrPagesDone;
                    if (_nativePdf.LoadFromFile(_OCREntity.OCRFilePath, false) == GdPictureStatus.OK)
                    {
                        string ocrlanguage = "eng";
                        if (_OCREntity.OCRLanguage != null && !_OCREntity.OCRLanguage.Equals(""))
                            ocrlanguage = _OCREntity.OCRLanguage;

                        if (OCRLanguageCheck) {
                            ocrlanguage = OCRLanguageText;
                        }
                        var status = _nativePdf.OcrPages("*", _OCREntity.OCRThreadMaxCount, ocrlanguage, _OCREntity.OCRPath, "", 300);
                        
                   }
                }
                else {
                    Console.WriteLine(DateTime.Now + " OCR No Start");
                    CanStartOCR = true;
                }                
            }
        }
        
        private void _nativePdf_OcrPagesDone(GdPictureStatus Status)
        {
            _nativePdf.OcrPagesProgress -= _nativePdf_OcrPagesProgress;
            _nativePdf.OcrPagesDone -= _nativePdf_OcrPagesDone;
            Console.WriteLine(DateTime.Now+" Page Done" + Status.ToString()+"{"+ _OCREntity._fileName+ "}");
            if (Status == GdPictureStatus.OK)
            {
                Status = _nativePdf.SaveToFileInc(_OCREntity.OCROutputPath);
                if (Status == GdPictureStatus.OK)
                {
                    _nativePdf.CloseDocument();
                    _nativePdf.ClosePath();
                    CanStartOCR = true;

                    Document doc = _Job.Document;
                    doc.OCRPDF = true;
                    doc.OCRPDFName = _OCREntity.OCROutputFileName;
                    doc.OCRPDFPath = _OCREntity.OCROutputPath;
                    doc.OCRPDFRootPath = _OCREntity.OCROutputRootPath;
                    doc.Status = Constant.BatchStatus.Completed;
                    doc.Station = Constant.BatchStation.PDF;

                    _OCRBuilder.OCRCompleted(_Job, doc);
                    _OCRBuilder.ExportWaiting(_Job);

                    OnOCRPagesDoneRequest(Status.ToString(), _PageNo, _Processed, _Count);
                }
            }            
                  
        }
        private int _PageNo ;
        private int _Processed;
        private int _Count;
        private void _nativePdf_OcrPagesProgress(GdPictureStatus Status, int PageNo, int Processed, int Count, ref bool Cancel)
        {
            _PageNo = PageNo;
            _Processed = Processed;
            _Count = Count;
            OnOCRPagesProgressRequest(Status.ToString(), PageNo, Processed, Count);
        }

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: NullReferenceException when doing PDF OCR

Post by David » Thu Feb 16, 2017 4:40 pm

Hi,

Having a look at your code I can detect a resource leak.

Please have a look at the _nativePdf_OcrPagesDone method. If the character recognition engine fails to read the document for some reason (document too large, not enough memory, etc.) the Status parameter may be different than GdPictureStatus.OK. This will lead the software not to call CloseDocument and thus not to release the memory used my the object.

We are used to deal with confidential information. If you wish we can sign an NDA so you can provide your document and we could reproduce on our end.

Regards,

David

attila1977
Posts: 5
Joined: Fri Jan 20, 2017 4:54 am

Re: NullReferenceException when doing PDF OCR

Post by attila1977 » Tue Feb 21, 2017 4:57 am

Hi David,

Thank you for your reply.

I created a dummy image pdf to reproduce the error in my test environment. The same error occurred again.
I duplicated 50 copies from this image pdf (https://drive.google.com/file/d/0BxI_4n ... sp=sharing), put them into a folder , let my application process them one by one.
Error occurred when the application was processing the 27th PDF.
My test environment: windows 2008 r2 , 1X E5410 quad core CPU, 32 GB RAM, GdPicture .NET sdk v12.0.57, 64 bit platform, 3 OCR treads.

I hope it can help you to reproduce the error on your end.

Thanks.

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: NullReferenceException when doing PDF OCR

Post by David » Thu Feb 23, 2017 10:59 am

Hi,

I'm sorry but I'm not able to reproduce the issue with the latest GdPicture.NET 12.

May I ask you to update and confirm the latest GdPicture.NET 12 solves the issue?

I'm looking forward to hearing from you.

David

attila1977
Posts: 5
Joined: Fri Jan 20, 2017 4:54 am

Re: NullReferenceException when doing PDF OCR

Post by attila1977 » Thu Mar 02, 2017 4:31 am

Hi David,

I've updated to the latest GdPicture.NET 12.
The issue still remains.

Cedric
Posts: 263
Joined: Sun Sep 02, 2012 7:30 pm

Re: NullReferenceException when doing PDF OCR

Post by Cedric » Fri Mar 03, 2017 11:35 am

Hello,

We are still trying to reproduce the issue but without success for the moment.
We are going to let the process run during the weekend to see if it happens on a long run.
In any case we will let you know the result.

Cedric
Posts: 263
Joined: Sun Sep 02, 2012 7:30 pm

Re: NullReferenceException when doing PDF OCR

Post by Cedric » Mon Mar 06, 2017 10:42 am

Hi,

We are still unable to reproduce the issue even with very long runs.
Could you please share a reduced application that we can run as-is?

attila1977
Posts: 5
Joined: Fri Jan 20, 2017 4:54 am

Re: NullReferenceException when doing PDF OCR

Post by attila1977 » Tue Mar 07, 2017 6:17 am

Hi Cedric,
Thanks for your help, I will prepare a reduced application.

benedikt
Posts: 1
Joined: Wed Aug 08, 2018 11:57 am

Re: NullReferenceException when doing PDF OCR

Post by benedikt » Wed Aug 08, 2018 12:01 pm

I've got this issue when disposing the imaging and pdf instance before the ocr process finished. My solution for now was to set
the sync option to true:

Last parameter here:

Code: Select all

pdfInstance.OcrPages("*", 0, language, GdPictureHelper.OCRDirectory, "", resolution, 0, true);
Complete code, which cause the error:

Code: Select all

        public byte[] Convert(byte[] data, bool embeddOCRText = true, string language = "deu")
        {
            byte[] pdf = null;

            using (var pdfInstance = GdPictureHelper.GetPDFInstance())
            {
                using (var gdPictureImaging = GdPictureHelper.GetImagingInstance())
                {
                    int imageId = gdPictureImaging.CreateGdPictureImageFromByteArray(data);
                    if (gdPictureImaging.GetStat() == GdPictureStatus.OK)
                    {
                        float resolution = System.Math.Max(200, gdPictureImaging.GetVerticalResolution(imageId));
                        var state = pdfInstance.NewPDF(embeddOCRText);

                        if (state == GdPictureStatus.OK)
                        {
                            for (int i = 1; i <= gdPictureImaging.GetPageCount(imageId); i++)
                            {
                                if (gdPictureImaging.SelectPage(imageId, i) == GdPictureStatus.OK)
                                {
                                    var addImageResult = pdfInstance.AddImageFromGdPictureImage(imageId, false, true);
                                }
                            }

                            pdfInstance.OcrPages("*", 0, language, GdPictureHelper.OCRDirectory, "", resolution, 0, true);

                            using (var stream = new MemoryStream())
                            {
                                pdfInstance.SaveToStream(stream);
                                stream.Position = 0;
                                pdf = stream.ToArray();
                            }
                        }
                        else
                        {
                            throw new Exception($"Culd not convert document. State: {state}");
                        }
                    }
                    else
                    {
                        throw new Exception("Could not create gdpicture imaging instance");
                    }

                    // Close pdf document
                    pdfInstance?.CloseDocument();

                    // Release gdpicture image
                    gdPictureImaging.ReleaseGdPictureImage(imageId);
                }
            }
The last to parts (CloseDocument and ReleaseGdPictureImage) can be skipped as far as i know.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest