I have a clear document which is being scanned and OCRed. On this document there is only a few lines of text, all of which are clear. The resultant scan is of good quality as well. However, a string "TITLE 1120S K-1" is consistently creating a OCRed text of "TITLE 11208 K-1" We are wondering what to do about this. Is the engine seeing a string of numbers and assuming the next character is a number? Is there anything we can do to fix this?
Doc.It Development Team
Could you please attach the image you are talking about?
- (70.32 KiB) Downloaded 111 times
The Attached zip file in my previous thread has one PDF file and One Tif image file. You will see both of them will output the same result. Please let us know how we can make it work.
Unfortunately there is nothing we can do. The Tesseract engine has characterized this as one word of digits, and thus assumes the 'S' is a segmented '8', especially that the line endings of the 'S' are closer to the middle than most fonts.
Who is online
Users browsing this forum: No registered users and 1 guest