简体   繁体   中英

How can I fix my Chinese PDF parsed in Apache Tika for Python to read the characters correctly?

I have a downloaded PDF in Chinese that I want to extract text from using Tika + Python (the original link to the full PDF can be found here , or an extracted sample page can be found here ).

When I ran the following code

analysed_file = 'D:\\5_Programming\\test.pdf'

# Parse data from file
file_data = parser.from_file(analysed_file, "http://localhost:9998/")

# Get files text content
text = file_data['content']
print(text)

it printed hollow boxes in the command line. When I copy those boxes and paste a sample here it looks like

£Î £á £÷ £á £ú £¬ £ó £è £õ £ê £á ÄÇ Íß ×È £¬ Êæ ¼Ó

£Ï £æ £æ £é £ã £å £ò £¬ £Ì £® £È £® °Â ·Æ ɪ £¬ £Ì £® £È £®

£Ð £á £õ £ì £ó £¬ £Â £® £Ä £é £á £î £å ±£ ¶û ˹ £¬ £Â £® ÷ì °² ÄÈ

I created a PDF using latin characters and parsed it using the exact same script and it printed completely fine in the command line.

I opened the file in Acrobat to troubleshoot and it gave me the error message that it "Cannot find or create the font [ garbled characters ]". It also displayed all characters as bullets , which is its apparent behaviour if it doesn't recognise the font ( https://helpx.adobe.com/au/acrobat/using/pdf-fonts.html ):

However, in the Google Chrome PDF viewer the entire text is being displayed correctly in Chinese.

What is Google Chrome doing differently that allows it to be read while it appears garbled in Adobe Acrobat and Tika + Python, and how might I fix this issue with the PDF to allow Tika to parse it correctly? Is it an encoding or font issue? I am not directly concerned with it printing correctly in Acrobat.

Hey welcome to the Stack Overflow society. It is possible that the Chinese fonts aren't installed in the Adobe Reader. You can install them from this link (scroll to the section called Add-Ons). There are two font packs available. You can try installing these and let me know how this goes.
Regards,
Truly Amazing Vidoes by Ravi Arora

You can use the Apache Tika together with the Google Tesseract Parser started as a docker image - blog post

Then You have to add the proper language in tesseract: for instance tesseract-ocr-chi-sim (Simplified Chinese) . List of languages: list

docker exec -it tika-server-ocr /bin/bash
apt-get update
apt-get install tesseract-ocr-chi-s

Then You need to enable OCR (for pdf parsing) an set the Chinese as a language:

curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_ONLY" -H "X-Tika-OCRLanguage: chi-sim"  -T test.pdf localhost:9998/tika

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM