简体   繁体   English

Tesseract OCR 重音问题,图像增强不够

[英]Tesseract OCR accents problems, image enhancement not enough

I really need your help with Tesseract.我真的需要你对 Tesseract 的帮助。 I'm using Tesseract and pdf2image to extract informations from a scanned PDF file.我正在使用 Tesseract 和 pdf2image 从扫描的 PDF 文件中提取信息。 My problem is that Tesseract messes with the accents é, è et ê (i'm french) and with the lowercase "i" and upcase "I".我的问题是 Tesseract 与重音 é、è et ê(我是法国人)以及小写的“i”和大写的“I”混淆了。 I tried processing the images first but can't get any good output.我尝试先处理图像,但无法获得任何好的 output。

This the code i'm using:这是我正在使用的代码:

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' 

filePath = askopenfilename()
img = convert_from_path(filePath,poppler_path=r'C:\poppler-0.68.0_x86\poppler-0.68.0\bin')
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)


for page_number in range(len(img)):
    img[page_number].save(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg', 'JPEG')

    
work_img = None
# Tesseract
custom_config = r'--oem 3 --psm 6'
kernel = np.ones((1, 1), np.uint8)

for page_number in range(len(img)):
    img1 = cv2.imread(r'C:\Users\488096\Documents\page'+ str(page_number) +'.jpg')
    #Traitement des images afin d'obtenir une meilleure reconnaissance des caractères
    gray = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
    # Remove shadows
    cool_img = cv2.dilate(gray, kernel, iterations=1)
    norm_img = cv2.erode(cool_img, kernel, iterations=1)
    # Threshold using Otsu's
    work_img = cv2.threshold(cv2.bilateralFilter(norm_img, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # Save pages as images in the pdf
    txt = txt + (pytesseract.image_to_string(work_img,config=custom_config).encode("utf-8")).decode('utf-8')
    print("Page # {} - {}".format(str(page_number),txt))

What can I do to obtain good results?我该怎么做才能获得好的结果? Thanks a lot !非常感谢 !

Maybe you have to install the french language pack, more info here也许你必须安装法语语言包,更多信息在这里

https://pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/ https://pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/

Furthermore, you can use ocrmypdf, for me, is the easiest way to read pdfs to text: https://pypi.org/project/ocrmypdf/此外,您可以使用 ocrmypdf,对我来说,这是将 pdf 读取为文本的最简单方法: https://pypi.org/project/ocrmypdf/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM