We have a requirement to extract text from a scanned PDF document. We are doing English language OCR. We have been able to use GDPicture to do that but a lot of the extracted text in not correct.
We thought we may get better results if we convert PDF to TIF first and then run OCR on it. The results were a little better than before, but still a lot of inaccuracies in text.
Then we tried converting the PDF to TIF using a separate product called 2TIff. When we ran GDPicture OCR of that TIF, the results were much much better and accurate.
I have attached the original TIF files and their results.
Could you please tell what is GDPicture not doing that 2Tiff did to get worse OCR results using the same GDPicture OCR engine? Is there a way to improve the TIF conversion from PDF?
https://drive.google.com/file/d/1mNfOCZ ... sp=sharing
May I ask you to provide us with the exact code snippet you are using for OCR so we can replicate your issues? We do not know what 2Tiff is doing. In order to provide you support on GdPicture.NET toolkit, we need to reproduce your issues using the current release. Then we can investigate them more.
Thank you for your understandings and we are waiting for the code and exact steps on how to replicate it.
Code: Select all
Private Function ConvertTifToOCR(TifFilename As String, textFilename As String) As Boolean Dim inputTifObj As GdPictureImaging = New GdPictureImaging() Dim pageCount As Integer Dim imageID As Integer = inputTifObj.CreateGdPictureImageFromFile(TifFilename) If inputTifObj.GetStat() = GdPictureStatus.OK Then If inputTifObj.TiffIsMultiPage(imageID) Then pageCount = inputTifObj.TiffGetPageCount(imageID) End If Dim ocrObj As GdPictureOCR = New GdPictureOCR() ocrObj.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" ocrObj.CharacterSet = "" ocrObj.AddLanguage(OCRLanguage.English) Dim resID As String = "page" Dim content As String = Nothing Dim stream As System.IO.StreamWriter = New System.IO.StreamWriter(textFilename) For i As Integer = 1 To pageCount inputTifObj.TiffSelectPage(imageID, i) If ocrObj.SetImage(imageID) = GdPictureStatus.OK Then ocrObj.OCRMode = OCRMode.FavorAccuracy ocrObj.RunOCR(resID) If ocrObj.GetStat() = GdPictureStatus.OK Then content = ocrObj.GetOCRResultText(resID) If ocrObj.GetStat() = GdPictureStatus.OK Then stream.WriteLine(content & vbFormFeed & vbCrLf) End If Else MessageBox.Show("The Ocr didn't process. Error: " + ocrObj.GetStat().ToString()) End If Else MessageBox.Show("The image can't be set. Error: " + ocrObj.GetStat().ToString()) End If ocrObj.ReleaseOCRResult(resID) Next stream.Close() inputTifObj.ReleaseGdPictureImage(imageID) ocrObj.Dispose() MessageBox.Show("Tif file processed through OCR") Return True Else MessageBox.Show("The Tif file can't be opened. Error: " + inputTifObj.GetStat().ToString()) End If inputTifObj.Dispose() Return False End Function
I would like to explain to you here some more details about OCR. From what I see, you saved the scanned pages in PDF document. Using GdPictureOCR class you will need the scanned image, so here I would recommend you to scan directly to tiff. Next, you need to scan using appropriate DPI, so the scanned page will be readable. The precision of the OCRed text you can also achieve using another set of languages, for further details read here:
https://github.com/tesseract-ocr/tesser ... Data-Files
There are different language files for fast OCR and accurate OCR. And finally, the OCR'ed text will be more accurate when doing OCR on regions as on the whole pages. I hope this help.
Running GDPicture OCR on PDFs produced worst results in terms of text accuracy.
Running GDPicture OCR on TIF converted from PDF using GDPicture produced better results in term of accuracy.
Running GDPicture OCR on TIF converted from PDF using 2Tiff produced best results in terms of text accuracy.
We are definitely using the accurate OCR trained files.
Here is an interesting source that can be useful:
https://github.com/tesseract-ocr/tesser ... oveQuality
Thank you also for creating a support ticket.
Finally, we have figured out that the source PDF has internal page rotation. After solving this with the use of NormalizePage() method the OCR results are excellent and there is no need to convert to TIFF.
So maybe this helps also to others.
Who is online
Users browsing this forum: No registered users and 0 guests