Is there any plan to allow HOCR file creation with Tesseract? I've tried to use gdp.OCRTesseractSetVariable("tessedit_create_hocr", "1"); but it doesn't seems to save the file anywhere.
Maybe this functionnality would not be too hard to implement considering Tesseract already does it.
Thank you and have a nice week end!
The feature is part of the wish list but we have not set a high priority on it. At the moment I thus cannot communicate a release date.
Would it be possible for you to describe us what you wish to do with the HOCR ouput?
Please note you can access the full text by the mean of GetPageText:
http://guides.gdpicture.com/content/web ... eText.html
GetPageText will not retrieve all of the details the HOCR may contain but maybe the rough text will be sufficient for your need. Please note it GetPageText works with searchable PDF and also with text PDF.
- Extract the text layer in HOCR format.
- Make manipulations on the results
- Create a searchable PDF from HOCR file. You wouldn't need to make OCR again at this point here, so it's a performance gain.
Also, none of the provider that I know off really support or have great support for that. That's a sweet spot for GdPicture to exploit
I'm confident you could get that done with a minimum of effort as Tesseract already supports it and you offer a wrapper around their libraries.
Who is online
Users browsing this forum: No registered users and 1 guest