How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other tool

Example requests & Code samples for GdPicture Toolkits.
Post Reply
reisrf
Posts: 23
Joined: Thu May 28, 2015 4:30 pm

How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other tool

Post by reisrf » Thu Jun 07, 2018 11:38 pm

I will receive some PDFs where some pages can have text (invisible) so we can search (example: PDFs created by OCR tools, or Office tools or others). And other pages where it will be a scanned image without ocr contents , so we can´t search. For the pages without OCR contents I need to apply OCR and create the hidden text in the specific locations (this I know how to do). My question is: how to detect a page has or not invisible text?

Thanks in advance

Robson Reis



reisrf
Posts: 23
Joined: Thu May 28, 2015 4:30 pm

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Post by reisrf » Tue Jun 19, 2018 8:52 pm

PageHasText method is returning True even if in the page we have only special characeters like \r, \n, \l, .... I have created by own PageHasText, using GetPageText:

string pageText = Regex.Replace(_gdPDF.GetPageText(), "[^0-9a-zA-Z]+", string.Empty).Trim();
return (pageText.Length == 0 ? false : true) ;

The snippet above returns True if we have at least a number or a letter (lower or uppercase) and false if there are only spaces or special characters.

Gabriela
Posts: 244
Joined: Wed Nov 22, 2017 9:52 am

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Post by Gabriela » Mon Jan 21, 2019 4:48 pm

Hi,

The PageHasText() method returns true/True if an arbitrary text is on the page. Special characters are considered as text; hence the method is working correctly. Your workaround is nice, and it is working for you very well. It always depends on the requirements you have for your application. Methods intended to work generally needs to do the proper job for all users. You can open a ticket on our support platform if you need some "custom" method so we can investigate it further and offer you a solution.
Kind regards,

Gabriela
GdPicture Team

reisrf
Posts: 23
Joined: Thu May 28, 2015 4:30 pm

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Post by reisrf » Mon Jan 21, 2019 6:13 pm

No worries. My custom code is in place and it is working as expected. Case can be closed. Many thanks

Gabriela
Posts: 244
Joined: Wed Nov 22, 2017 9:52 am

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Post by Gabriela » Mon Jan 21, 2019 9:02 pm

Hi,

Thank you for your return. Please do not hesitate to contact us if will need any custom solution or further technical assistance.
Kind regards,

Gabriela
GdPicture Team

Post Reply

Who is online

Users browsing this forum: No registered users and 4 guests