Post by IXDev2 » Mon Nov 03, 2014

Problem Statement:
-On some pdf files OCR engine is not able to recognize some numbers (for ex. 0(Zero) is recognized as English alphabet "O", 1(One) is recognized as English alphabet "I").

Proposed Solution:
To resolve such problems we are trying to create 'Trainable OCR Tool'. Inside this tool user can manually select the characters which are not properly recognized by the OCR and from these selected characters he can create his own custom dictionary(.traineddata file).

Query on GDPicture:
We would like to know whether GDPicture provides any mechanism to create .traineddata file by providing any text input to it.
Like for Google tesseract-ocr we found following information related to creation of .traineddata file.
-Using third party tool (like txt2image , ghostscript) we can create the .tif image file either from txt file or PDF file and after that we can create .box file by providing the .tif file to tesseract.exe .
-Using this tif and box file and following the procedure given at ... Tesseract3 we can generate .traineddata file.

Does GDPicture provide such capability?

Re: Support To Create .traineddata files.

Post by David » Mon Feb 08, 2016


GdPicture doesn't expose this feature at the moment.

Please note training an engine my be quite challenging and requires quite some knowledge in computer vision. May I ask you to share some image for which you have difficulties reading these numbers?

We would like to have a look at them to be able to identify the reason for the lake of accuracy. Please note we can offer character recognition engine fine tuning and training.



