Result of OCRTesseractDoOCR dont match with GetCharCount

Discussions about Tesseract OCR integration in GdPicture.
Post Reply
Slava
Posts: 66
Joined: Fri Jun 22, 2007 4:43 pm

Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Slava » Fri Oct 02, 2009 11:48 am

Hi,

I suspect a bug in somewhere in the OCR Tesseract plugin introduced since GdPicture version 5.12.0/5.12.1 (?)

OCRTesseractGetCharCount returns for some documents a larger number than length of a string (without spaces/new lines) returned from OCRTesseractDoOCR function. If I do the following loop:

Code: Select all

for i:=1 to oGdPicture.OCRTesseractGetCharCount do
  teststring := teststring + Char(oGdPicture.OCRTesseractGetCharCode(i));
Then I compare this teststring to the Result from OCRTesseractDoOCR and there are 2 more chars placed in the middle of text! At position 46 and 51 in my current test document these appear: #25 and ¬
They do not exist in the string returned from OCRTesseractDoOCR function.

Please respond as soon as possible.

Kind regards,
Slava

User avatar
Loïc
Site Admin
Posts: 5584
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Loïc » Fri Oct 02, 2009 12:10 pm

Hi,

Please send us code snippet + original document to reproduce this error.

Kind regards,

Loïc

Slava
Posts: 66
Joined: Fri Jun 22, 2007 4:43 pm

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Slava » Fri Oct 02, 2009 1:42 pm

Hi Loic,

I've created a code snippet for you:

Code: Select all

var
  i, j: Integer;
  OCRResultNoSpaces, OCRResult, teststring: String;
begin
  oGdPicture.SetNativeImage(viewer.GetNativeImage);
  OCRResult := oGdPicture.OCRTesseractDoOCR(TesseractDictionaryDutch, ocrDictionaryPATH, '');
  OCRResultNoSpaces := StringReplace(OCRResult, ' ', '', [rfReplaceAll]);
  OCRResultNoSpaces := StringReplace(OCRResultNoSpaces, #13, '', [rfReplaceAll]);
  OCRResultNoSpaces := StringReplace(OCRResultNoSpaces, #10, '', [rfReplaceAll]);

  j := 1;
  for i:=1 to oGdPicture.OCRTesseractGetCharCount do
    begin
      teststring := teststring + Char(oGdPicture.OCRTesseractGetCharCode(i));
      if Char(oGdPicture.OCRTesseractGetCharCode(i)) <> OCRResultNoSpaces[j] then
        ShowMessage('!Foreign Char @ pos '+ IntToStr(i) +': '+ Char(oGdPicture.OCRTesseractGetCharCode(i)))
      else
        Inc(j);
    end;
  //breakpoint here and evaluate OCRResultNoSpaces vs teststring
  Showmessage('Same results: '+ BoolToStr(teststring = OCRResultNoSpaces, True));
end;
About 50% of all invoices produce this behavior. I will send a few as soon as I can to esupport address.

Kind regards,
Slava

User avatar
Loïc
Site Admin
Posts: 5584
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Loïc » Fri Oct 02, 2009 3:13 pm

Hi Slava,

Problem solved (was an unicode issue).

I will publish a fixed edition on Monday.

Kind regards,

Loîc

Slava
Posts: 66
Joined: Fri Jun 22, 2007 4:43 pm

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Slava » Tue Oct 06, 2009 9:32 pm

Hi Loic,

When do you plan to release the new version? It's tuesday now and there's still no fix released. At least not for the ActiveX version.

Kind regards,
Slava

User avatar
Loïc
Site Admin
Posts: 5584
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Loïc » Tue Oct 06, 2009 9:34 pm

Hi,

I will be able to publish the new edition at the end of the week due to H5N1 (I must stay at home)...

Thank you for your comprehension,

Loïc

Slava
Posts: 66
Joined: Fri Jun 22, 2007 4:43 pm

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Slava » Tue Oct 06, 2009 10:15 pm

Loic,

I apologize for rushing things up, I didn't know that. It is a dangerous type, are you going to be ok?

Good luck with it.

Slava

User avatar
Loïc
Site Admin
Posts: 5584
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Loïc » Tue Oct 06, 2009 10:25 pm

Thank you Slava ! :D

It is not dangerous at all. I am just a bit tired. All will be ok within 1 day or two with new release for .NET & ActiveX editions !

Cheers,

Loïc

JRaboin
Posts: 2
Joined: Fri Oct 24, 2014 6:13 pm

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by JRaboin » Fri Oct 24, 2014 6:19 pm

Good day,

It seems that this problem exists in version 10.2.0.30 of GdPicture.

From what I can see <spaces> are not included in the OCRTesseractGetCharCount call. Along with that \r and/or \n are sometimes included. The recognition seems random but it is difficult to get an accurate count/confidence per character when reading thru the string returned form the OCRTesseractDoOCR call.

int[] coords = StringToInts(fldData[2]);
m_GdPictureImaging.SetROI(coords[0] + offset[1], coords[1] + offset[2], coords[2], coords[3]);
string tmpRetval = m_GdPictureImaging.OCRTesseractDoOCR(imageID, _dictionary, _dictionaryPath, _charWhiteList);

// Get characters read and each characters confidence
//
_OCRcharCount = m_GdPictureImaging.OCRTesseractGetCharCount();
//

The length of tmpRetval when recognizing a column of information is always larger then the value of _OCRcharCount.

Please advise.

User avatar
Loïc
Site Admin
Posts: 5584
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Loïc » Fri Oct 24, 2014 6:31 pm

Hello,

We are already handling your issue in the helpdesk system. So please do not cross post here.

We will give you a feedback on the helpdesk as soon we will have something.

With best regards,

Loïc

User avatar
Loïc
Site Admin
Posts: 5584
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by Loïc » Fri Oct 24, 2014 6:35 pm

btw the answer was:
Spaces are not considered as characters by the OCR engine.
If you want to get the number of spaces between 2 recognized glyphs just use the OCRTesseractGetCharSpaces() method.

Please let me know if you need further information.

JRaboin
Posts: 2
Joined: Fri Oct 24, 2014 6:13 pm

Re: Result of OCRTesseractDoOCR dont match with GetCharCount

Post by JRaboin » Fri Oct 24, 2014 6:37 pm

Loïc,

Please accept my apology. I do not recall logging an issue on this particular problem. It was just a quick google search brought this topic up and I thought a quick post against it would help not only me but others if they experience the same issue.

Best regards,
James

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest