HOCR

Feature Requests for GdPicture.NET.
Post Reply
SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

HOCR

Post by SarrasiM » Fri Mar 04, 2016 10:07 pm

Good day,

Is there any plan to allow HOCR file creation with Tesseract? I've tried to use gdp.OCRTesseractSetVariable("tessedit_create_hocr", "1"); but it doesn't seems to save the file anywhere.

Maybe this functionnality would not be too hard to implement considering Tesseract already does it.

Thank you and have a nice week end!

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: HOCR

Post by David » Mon Mar 07, 2016 3:51 pm

Hi,

The feature is part of the wish list but we have not set a high priority on it. At the moment I thus cannot communicate a release date.

Would it be possible for you to describe us what you wish to do with the HOCR ouput?

Than you

David

SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

Re: HOCR

Post by SarrasiM » Mon Mar 07, 2016 4:02 pm

Hello David,

We'd like to explore text based document classification using HOCR.

Thanks!

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: HOCR

Post by David » Tue Mar 08, 2016 9:57 am

Hi,

Please note you can access the full text by the mean of GetPageText:
http://guides.gdpicture.com/content/web ... eText.html

GetPageText will not retrieve all of the details the HOCR may contain but maybe the rough text will be sufficient for your need. Please note it GetPageText works with searchable PDF and also with text PDF.

Regards,

David

SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

Re: HOCR

Post by SarrasiM » Thu Mar 10, 2016 8:04 pm

Hello David,

Thanks for the information. Unfortunately we also need positional information of text block :)

SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

Re: HOCR

Post by SarrasiM » Wed Apr 06, 2016 5:24 pm

It also be a good addition for scenarios like this one:

- Extract the text layer in HOCR format.
- Make manipulations on the results
- Create a searchable PDF from HOCR file. You wouldn't need to make OCR again at this point here, so it's a performance gain.

Also, none of the provider that I know off really support or have great support for that. That's a sweet spot for GdPicture to exploit :)

I'm confident you could get that done with a minimum of effort as Tesseract already supports it and you offer a wrapper around their libraries.

Thank you!

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest