简体   繁体   中英

convert pdf to text file in python

My code works perfectly for some pdf, but some show error:

Traceback (most recent call last):
  File "con.py", line 24, in <module>
    print getPDFContent("abc.pdf")
  File "con.py", line 17, in getPDFContent
    f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)

My code is

import pyPdf

def getPDFContent(path):

    content = ""

    pdf = pyPdf.PdfFileReader(file(path, "rb"))

    for i in range(0, pdf.getNumPages()):
        f=open("xxx.txt",'a')
        content= pdf.getPage(i).extractText() + "\n"
        import string
        c=content.split()
        for a in c:
            f.write(" ")
            f.write(a)
        f.write('\n')
        f.close()

    return content

print getPDFContent("abc.pdf")

Try

import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())

Your problem is that when you call f.write() with a string, it is trying to encode it using the ascii codec. Your pdf contains characters that can not be represented by the ascii codec. Try explicitly encoding your str , eg

a = a.encode('utf-8')
f.write(a)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM