How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other tool

Example requests & Code samples for GdPicture Toolkits.
Post Reply
reisrf
Posts: 20
Joined: Thu May 28, 2015 4:30 pm

How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other tool

Post by reisrf » Thu Jun 07, 2018 11:38 pm

I will receive some PDFs where some pages can have text (invisible) so we can search (example: PDFs created by OCR tools, or Office tools or others). And other pages where it will be a scanned image without ocr contents , so we can´t search. For the pages without OCR contents I need to apply OCR and create the hidden text in the specific locations (this I know how to do). My question is: how to detect a page has or not invisible text?

Thanks in advance

Robson Reis



reisrf
Posts: 20
Joined: Thu May 28, 2015 4:30 pm

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Post by reisrf » Tue Jun 19, 2018 8:52 pm

PageHasText method is returning True even if in the page we have only special characeters like \r, \n, \l, .... I have created by own PageHasText, using GetPageText:

string pageText = Regex.Replace(_gdPDF.GetPageText(), "[^0-9a-zA-Z]+", string.Empty).Trim();
return (pageText.Length == 0 ? false : true) ;

The snippet above returns True if we have at least a number or a letter (lower or uppercase) and false if there are only spaces or special characters.

Post Reply

Who is online

Users browsing this forum: No registered users and 0 guests