The OCRTesseractDoOCR method returns a single string including all words but no coordinates.
Using OCRTesseractGetCharCount and OCRTesseractGetCharLeft type methods I can cycle through the individual characters
Is there a method to extract the words and coordinates from an image?
Clearly Tesseract applies the logic to determine what is a word as this must be used when returning the string in the DoOCR method.
Thanks for any help.
I think, the functionally to get words from an OCR engine is very important, too!
Every other OCR engine is delivering words, which then can be searched for patterns or keywords etc.
Not the actual thesseract engine like I understand.
Strangely, the .SaveAsPDFOCR Method creates words in the final PDF that can be searched.
Is there no other way than to cycle through the results and split the chars with the found spaces to create words?
Or are we doing something wrong? Haven't we found a mtehod or a parameter that can change this behavior yet?
The process is quite simple. Just retrieve all recognized characters of the document using the appropriated method. During your iteration, if the method OCRTesseractGetCharSpaces() returns a value different of 0 you are at the beginning of a new word.
Let me know if you need further information.
thank you for your reply.
Sure, you're right, on the one hand, it is surely easy and I understand, that the Tesseract engine ist from a Google project and none of your major programming tasks. But you are the one who are providing a SDK with it. So if everyone has do this "simple programming" why not offer this in a just more simple method/class in the next update, like ".OCRTesseractGetWordCount"
Here is mine little programming:
I used to store the OCR results (before GDPicture) in an array of this simple structure
Code: Select all
Public Structure OCRDataStruct Public Coord As RectangleF Public Text As String Public Confidence As Double End Structure
And there is the first problem, how to get the bounding box?
So many months ago, I wrote this little helper routine to get the words from the chars with space like you described it in your post, and also get the bounding box from the existing data:
Code: Select all
' Build WordList with Coordinates Dim wordList As New List(Of OCRDataStruct), word As String, newWord As OCRDataStruct Dim maxBottom As Long, maxRight As Long For i = 1 To tOCRGdPictureImaging.OCRTesseractGetCharCount If i = 1 Then newWord.Text = "" newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0) Else If tOCRGdPictureImaging.OCRTesseractGetCharSpaces(i) Then newWord.Text = word newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top) wordList.Add(newWord) newWord.Text = "" newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0) word = "" maxBottom = 0 maxRight = 0 End If End If word += ChrW(tOCRGdPictureImaging.OCRTesseractGetCharCode(i)) maxBottom = Math.Max(maxBottom, tOCRGdPictureImaging.OCRTesseractGetCharBottom(i)) maxRight = Math.Max(maxRight, tOCRGdPictureImaging.OCRTesseractGetCharRight(i)) Next newWord.Text = word newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top) wordList.Add(newWord)
I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.
For example: The simple word "Look". The upper "L" and the lower "k" are chars that will be recognized quite easy.
But the double lower "o" kann also be interpreted as two small zeros "0". So there is a valid chance, to decide, these are zeros instead of 0 with a confidence of e.g. 60:40 but there is a much lesser confidence of "L00k" instead of "Look" if using a dictionary on word basis instead of single character recognition.
I thought there is a dictionary that is used for the OCR recognition on word basis. And if it is so, why not deliver these results, too?
By the way, if a word is separated because it is too long for the rest of the line, e.g.
"........ swinging his long-
sword over his......." the separation on with space will not do the job. The only solution is a dictionary for these cases.
Every professional programmer, who is not only trying to make searchable PDF files, will need this functionality because on the word basis will be made decisions, wether keywords are found on defined positions or not.
Thank you very much for your patience and the update for the PDF ans MRC-jpgs.
We are also not satisified about the missing possibilities to get words. In our application we use the OCR for paperless booking. We scan the OCR-Result for Invoicenumbers, Dates, Amounts and and and to automatically book the invoice. For this we Need a good OCR Result or the possibility to find out the Convidence of the words to Show the User that an automatic booking is not possible.
It would be a nice advance, when the Tesseract API Returns Words, Coordinates and the Convidence.
The confidence of the engine is word based. Just get the confidence of the first char to get the confidence of the word. You will see that all chars in the word have the same confidence.I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.
@Roland please check the snippet of EFernkaes, its show a simple way to extract words. I don't know what I can say more.
We have added a new method to simplify word association. It is: OCRTesseractGetCharWord()
I've looked for a function to build this but I cannot find this anywhere.
This is quite obvious when looking at the overall results returned from 'OCRTesseractDoOCR' when splitting each line and removing empty ones using something similar to this:
"' 5th August 2017 Statement No. 66"
As you can see the 66Company is a problem as Company is actually located on line 2.
I'll try to build the word list differently using the method Loic mentioned but it would be handy to be able to use something like this:
if (OCRTesseractGetCharSpaces(i) > 0 || OCRTesseractEndOfLine(i))
The new class GdPictureOCR is now available for such purposes:
https://guides.gdpicture.com/content/we ... reOCR.html
Who is online
Users browsing this forum: No registered users and 2 guests