Character After String of Numbers Is OCRed as a Number

Discussions about Tesseract OCR integration in GdPicture.
Post Reply
DocIt
Posts: 3
Joined: Fri Jan 23, 2015 7:36 pm

Character After String of Numbers Is OCRed as a Number

Post by DocIt » Fri Jan 23, 2015 7:45 pm

Hello GDPictureTeam,

I have a clear document which is being scanned and OCRed. On this document there is only a few lines of text, all of which are clear. The resultant scan is of good quality as well. However, a string "TITLE 1120S K-1" is consistently creating a OCRed text of "TITLE 11208 K-1" We are wondering what to do about this. Is the engine seeing a string of numbers and assuming the next character is a number? Is there anything we can do to fix this?

Thanks

Doc.It Development Team

SamiKharma
Posts: 352
Joined: Tue Sep 27, 2011 11:47 am

Re: Character After String of Numbers Is OCRed as a Number

Post by SamiKharma » Tue Jan 27, 2015 9:56 am

Hi,

Could you please attach the image you are talking about?

Best,
Sami

DocIt
Posts: 3
Joined: Fri Jan 23, 2015 7:36 pm

Re: Character After String of Numbers Is OCRed as a Number

Post by DocIt » Wed Jan 28, 2015 3:34 am

Please take a look at attached file. As soon as you OCR them TITLE 1120S K-1 consistently creating a OCRed text of "TITLE 11208 K-1" .
Attachments
Documents.zip
images
(70.32 KiB) Downloaded 98 times

DocIt
Posts: 3
Joined: Fri Jan 23, 2015 7:36 pm

Re: Character After String of Numbers Is OCRed as a Number

Post by DocIt » Wed Jan 28, 2015 5:32 am

We need to get this working as soon as possible.

The Attached zip file in my previous thread has one PDF file and One Tif image file. You will see both of them will output the same result. Please let us know how we can make it work.

Thanks,

Doc.It Development

SamiKharma
Posts: 352
Joined: Tue Sep 27, 2011 11:47 am

Re: Character After String of Numbers Is OCRed as a Number

Post by SamiKharma » Wed Jan 28, 2015 11:01 am

Hi,

Unfortunately there is nothing we can do. The Tesseract engine has characterized this as one word of digits, and thus assumes the 'S' is a segmented '8', especially that the line endings of the 'S' are closer to the middle than most fonts.

Best,
Sami

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest