简体   繁体   中英

Extracting and merging PDFs via Pypdf2

I am a little stuck. I am trying to merge and extract the text from all the PDF files in a working directory. Then I would like to store the data in a CSV form to run additional analysis on it. However I keep getting a PyPDF2.utils.PdfReadError: EOF marker not found error. I have checked the resources however I am still struggling.

import PyPDF2
import os
from PyPDF2 import PdfFileMerger, PdfFileReader

merger = PdfFileMerger()
for filename in os.listdir():
    with open(filename,"rb") as source:
        tmp = PdfFileReader(source)
        merger.append(tmp)

tmp.write('tmp.csv', 'wb')
tmp.close()

Actually there's some small mistake in your code, you are creating tmp variable inside the loop but using it outside for writing to csv . And also, as per my knowledge you don't need to create with open and then create a PdfFileReader object for merging. Try to use this simple approach for merging multiple pdf-files :

import PyPDF2
import os
from PyPDF2 import PdfFileMerger, PdfFileReader

merger = PdfFileMerger()

for pdffile in os.listdir():
    merger.append(pdffile)

merger.write('tmp.csv')
merger.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM