Duplicate OCR Text

Discussions about Tesseract OCR integration in GdPicture.
Post Reply
jdaniels57
Posts: 2
Joined: Wed Mar 16, 2016 11:32 pm

Duplicate OCR Text

Post by jdaniels57 » Wed Mar 16, 2016 11:53 pm

We have been testing Gd Picture OCR functionality (Version 11) and have seen some issues that we are not sure why they are occurring. When selecting text, we are getting some words twice and some selections are not as accurate to select specific words. I have attached examples of what we are working with to see if you can determine what and why it is happening. We are calling the viewer.GetSelectedText() to get the user selected text. Could you please advise the best way to manage this issue.
Attachments
DuplicateOCR.zip
Examples of duplicate OCR selection
(726.36 KiB) Downloaded 63 times

Cedric
Posts: 263
Joined: Sun Sep 02, 2012 7:30 pm

Re: Duplicate OCR Text

Post by Cedric » Thu Mar 17, 2016 3:30 pm

I'm not sure to understand what you did there. What I get from your explanations is that you create a Word document, saved it to PDF and then run an OCR process over it, is that correct?
Because if yes, the duplicate text comes from the fact that you are extracting both the original text and the OCR text (which is transparently written on top on the image).
Running OCR on a document that is not an image is not really useful since you already have extractible text in the first place.

To avoid text duplication, you can either convert your source PDF to an image (it's called rasterization) and then run the OCR on it or you can directly extract the text without running the OCR.

jdaniels57
Posts: 2
Joined: Wed Mar 16, 2016 11:32 pm

Re: Duplicate OCR Text

Post by jdaniels57 » Tue Mar 22, 2016 12:47 am

Thank you for that explanation. As we were just using test files for our system, and attempted to account for what the user may do. Thank you we can modify our code to handle the issue.

Cedric
Posts: 263
Joined: Sun Sep 02, 2012 7:30 pm

Re: Duplicate OCR Text

Post by Cedric » Tue Mar 22, 2016 10:35 am

I see, in that case you might want to detect if the PDF already has text or not before running an OCR process, that can be done with the PageHasText method documented here: http://guides.gdpicture.com/content/web ... sText.html

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest