OCR results file

Discussions about Tesseract OCR integration in GdPicture.
Post Reply
misterT
Posts: 2
Joined: Wed Aug 17, 2016 6:26 am

OCR results file

Post by misterT » Wed Aug 17, 2016 6:32 am

Hi,

I've built an invoice recognition/learning product that relies on the Recostar OCR engine. In particular, it processes the XML OCR results file created by Recostar. As I can't seem to find any way to license Recostar and because I'm wanting to build a web based point and click indexing solution, I'm considering GdPicture.

What I'd like to know is does Tesseract produce a similar kind of output file giving all characters and words along with their locations? Or do you have to use this command to get this information: PdfReaderGetPageTextWithCoords

Note that I searched for that command in the online documentation and I get no hits which is a bit of a worry?

Thanks, Turhan

misterT
Posts: 2
Joined: Wed Aug 17, 2016 6:26 am

Re: OCR results file

Post by misterT » Wed Aug 24, 2016 11:46 pm

Do GdPicture staff monitor this forum at all? I've seen many very sensible questions in the forum go unanswered. And I've not had any response in over five days! To me that almost rules this product out because a product is really only as good as the support provided. Add to that the fact that back in 2010 this new method was released "PdfReaderGetPageTextWithCoords":

post9116.html?hilit=PdfReaderGetPageTex ... ords#p9116

So why can't I find any mention of it in documentation six years later? That's pretty much inexcusable from my perspective.

All of this is such a huge shame because the product actually looks really good. But there is no way I can risk launching a commercial product without quality support for the underlying engine driving it.

Cedric
Posts: 259
Joined: Sun Sep 02, 2012 7:30 pm

Re: OCR results file

Post by Cedric » Mon Aug 29, 2016 11:08 am

PdfReaderGetPageTextWithCoords is a method that was introduced in GdPicture.NET 7 which is a long time discontinued version and this method does not exist in the product any more.
The reason is simple: since GdPicture.NET 8, PDF features have grown a lot and there is a separate PDF plugin that is in charge of all the PDF aspect, including the text extraction feature.

In the current GdPicture.NET release (GdPicture.NET 12) the method you are looking for is in the GdPicturePDF class and is called GetPageTextWithCoords.
Here is a link to the corresponding documentation: http://guides.gdpicture.com/content/web ... oords.html

Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests