My code works perfectly for some pdf, but some show error:
Traceback (most recent call last):
File "con.py", line 24, in <module>
print getPDFContent("abc.pdf")
File "con.py", line 17, in getPDFContent
f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)
My code is
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
f=open("xxx.txt",'a')
content= pdf.getPage(i).extractText() + "\n"
import string
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
f.close()
return content
print getPDFContent("abc.pdf")
Try
import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())
Your problem is that when you call f.write()
with a string, it is trying to encode it using the ascii
codec. Your pdf contains characters that can not be represented by the ascii
codec. Try explicitly encoding your str
, eg
a = a.encode('utf-8')
f.write(a)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.