I have been trying to print the output to a new text file. But I get the error
TypeError: expected a character buffer object
What I'm trying to do is convert pdf to text and copy the text obtained to a new file.
import pyPdf
def getPDFContent():
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file("D:\output.pdf", "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
#content += pdf.getPage(i).extractText() + "\n"
print pdf.getPage(i).extractText().encode("ascii", "ignore")
# Collapse whitespace
#content = " ".join(content.replace(u"\xa0", " ").strip().split())
#return content
#getPDFContent().encode("ascii", "ignore")
getPDFContent()
s =getPDFContent()
with open('D:\pdftxt.txt', 'w') as pdftxt:
pdftxt.write(s)
I did try to initialize s
as str
but then I get the error as "can't assign to function call".
You are not returning anything getPDFContent()
so basically you are writing None
.
result=[]
for i in range(0, pdf.getNumPages()):
result.append(pdf.getPage(i).extractText().encode("ascii", "ignore")) # store all in a list
return result
s = getPDFContent()
with open('D:\pdftxt.txt', 'w') as pdftxt:
pdftxt.writelines(s) # use writelines to write list content
How your code should look:
def getPDFContent():
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file("D:\output.pdf", "rb"))
# Iterate pages
result = []
for i in range(0, pdf.getNumPages()):
result.append(pdf.getPage(i).extractText().encode("ascii", "ignore"))
return result
s = getPDFContent()
with open('D:\pdftxt.txt', 'w') as pdftxt:
pdftxt.writelines(s)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.