convert pdf to text file in python

Question

My code works perfectly for some pdf, but some show error:

Traceback (most recent call last):
  File "con.py", line 24, in <module>
    print getPDFContent("abc.pdf")
  File "con.py", line 17, in getPDFContent
    f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)

My code is

import pyPdf

def getPDFContent(path):

    content = ""

    pdf = pyPdf.PdfFileReader(file(path, "rb"))

    for i in range(0, pdf.getNumPages()):
        f=open("xxx.txt",'a')
        content= pdf.getPage(i).extractText() + "\n"
        import string
        c=content.split()
        for a in c:
            f.write(" ")
            f.write(a)
        f.write('\n')
        f.close()

    return content

print getPDFContent("abc.pdf")

Answer 1

Try

import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())

Answer 2

Your problem is that when you call f.write() with a string, it is trying to encode it using the ascii codec. Your pdf contains characters that can not be represented by the ascii codec. Try explicitly encoding your str , eg

a = a.encode('utf-8')
f.write(a)

convert pdf to text file in python

Question

2 answers

solution1
0 2015-03-14 12:34:22

solution2
0 ACCPTED 2015-03-14 12:55:30

convert pdf to text file in python

Question

2 answers

solution1 0 2015-03-14 12:34:22

solution2 0 ACCPTED 2015-03-14 12:55:30

solution1
0 2015-03-14 12:34:22

solution2
0 ACCPTED 2015-03-14 12:55:30