Can you please keep me informed about this? I am currently trying to use that example code to turn a non-searchable PDF into an OCR'd searchable PDF, also. The example code is not producing a searchable PDF. The example code only produces a PDF/A but does not have any embedded text. I know that it is at least performing the OCR operations with the dictionary files because each page takes a couple of seconds to process. There is no need for an example PDF because this does not work for any PDF that I test it with.
I am using C# and the .NET version of GdPicture Pro and Tesseract. Here's my code:
http://dpaste.org/uLWu/- Code: Select all
String dictionaries = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location) + @"\dictionaries";
// open the new pdf in the viewer
viewer.DisplayFromFile(out_file);
for (int x = 1; x <= viewer.PageCount; x++)
{
Console.WriteLine("Performing image twain on page {0}", x);
viewer.DisplayFrame(x);
Int32 rasterized_page = viewer.GetNativeImage();
if (x == 1)
imaging.TwainPdfOCRStartEx(String.Format("{0}.ocr.pdf", out_file), "", "", "", "", "", PdfEncryption.PdfEncryptionNone, PdfRight.PdfRightCanModify);
imaging.TwainAddGdPictureImageToPdfOCR(rasterized_page, TesseractDictionary.TesseractDictionaryEnglish, dictionaries);
}
// close the twaining
imaging.TwainPdfOCRStop();
viewer.CloseImage();