I've built an invoice recognition/learning product that relies on the Recostar OCR engine. In particular, it processes the XML OCR results file created by Recostar. As I can't seem to find any way to license Recostar and because I'm wanting to build a web based point and click indexing solution, I'm considering GdPicture.
What I'd like to know is does Tesseract produce a similar kind of output file giving all characters and words along with their locations? Or do you have to use this command to get this information: PdfReaderGetPageTextWithCoords
Note that I searched for that command in the online documentation and I get no hits which is a bit of a worry?
post9116.html?hilit=PdfReaderGetPageTex ... ords#p9116
So why can't I find any mention of it in documentation six years later? That's pretty much inexcusable from my perspective.
All of this is such a huge shame because the product actually looks really good. But there is no way I can risk launching a commercial product without quality support for the underlying engine driving it.
The reason is simple: since GdPicture.NET 8, PDF features have grown a lot and there is a separate PDF plugin that is in charge of all the PDF aspect, including the text extraction feature.
In the current GdPicture.NET release (GdPicture.NET 12) the method you are looking for is in the GdPicturePDF class and is called GetPageTextWithCoords.
Here is a link to the corresponding documentation: http://guides.gdpicture.com/content/web ... oords.html
Who is online
Users browsing this forum: No registered users and 2 guests