简体   繁体   English

在 python 中使用 PyPDF2 合并 pdf 文件时找不到 EOF 标记

[英]EOF marker not found while use PyPDF2 merge pdf file in python

When I use the following code当我使用以下代码时

from PyPDF2 import PdfFileMerger

merge = PdfFileMerger()

for newFile in nlst:
    merge.append(newFile)
merge.write("newFile.pdf")

Something happened as following:事情发生如下:

raise utils.PdfReadError("EOF marker not found")

PyPDF2.utils.PdfReadError: EOF marker not found

Anybody could tell me what happened?谁能告诉我发生了什么?

PDF is a file format, where a pdf parser normally starts reading the file by reading some global information located at the end of the file. PDF 是一种文件格式,pdf 解析器通常通过读取位于文件末尾的一些全局信息来开始读取文件。 At the very end of the document there needs to be a line with the content of在文件的最后,需要有一行内容

%%EOF %%EOF

This is a marker, where the pdf parser knows, that the PDF document ends here and the global information it needs, should be before this (a startxref section).这是一个标记,pdf 解析器知道 PDF 文档在此处结束,并且它需要的全局信息应该在此之前(一个 startxref 部分)。

I guess, that the error message you see, means, that one of the input documents was truncated and is missing this %%EOF-marker.我猜,您看到的错误消息意味着其中一个输入文档已被截断并且缺少此 %%EOF 标记。

After encountering this problem using camelot and PyPDF2 , I did some digging and have solved the problem.使用camelotPyPDF2遇到这个问题后,我做了一些挖掘并解决了这个问题。

The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.文件标记'%%EOF'的结尾应该是最后一行,但是一些 PDF 文件在这一行之后放置了大量的 javascript,读者无法找到 EOF。

Illustration of what the EOF plus javascript looks like if you open it:如果您打开它,EOF 加上 javascript 的外观说明:

 b'>>\r\n',
 b'startxref\r\n',
 b'275824\r\n',
 b'%%EOF\r\n',
 b'\n',
 b'\n',
 b'<script type="text/javascript">\n',
 b'\twindow.parent.focus();\n',
 b'</script><!DOCTYPE html>\n',
 b'\n',
 b'\n',
 b'\n',

So you just need to truncate the file before the javascript begins.所以你只需要在 javascript 开始之前截断文件。

Solution:解决方案:

def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
    # find the line position of the EOF
    for i, x in enumerate(txt[::-1]):
        if b'%%EOF' in x:
            actual_line = len(pdf_stream_in)-i
            print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
            break

    # return the list up to that point
    return pdf_stream_in[:actual_line]

# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
    txt = (p.readlines())

# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)

# write to new pdf
with open('data/XXX_fixed.pdf', 'wb' as f:
    f.writelines(txtx)

fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')

One simple solution for this problem (EOF marker not found).此问题的一个简单解决方案(未找到 EOF 标记)。 Open your .pdf file in other application (I used Libre office draw in Ubuntu 18.04).在其他应用程序中打开您的.pdf文件(我在 Ubuntu 18.04 中使用了 Libre office draw)。 Then export the file as .pdf .然后将文件导出为.pdf Using this exported .pdf file the problem will not persist.使用这个导出的.pdf文件,问题不会持续存在。

I've also got that problem and got a solution.我也遇到了这个问题并找到了解决方案。

First, python reads PDF as 'rb' or 'wb' as a binary read and write format.首先,python 将 PDF 读取为'rb''wb'作为二进制读写格式。

END OF FILE文件结束

Occurs when that there was an open parenthesis somewhere on a line, but not a matching closing parenthesis.当一行的某处有一个左括号,但没有匹配的右括号时发生。 Python reached the end of the file while looking for the closing parenthesis. Python 在查找右括号时到达了文件的末尾。

Here is the 1 solution:这是1解决方案:

  1. Close that file that you've opened earlier using this command使用此命令关闭您之前打开的文件

    newfile.close()

  2. Check whether that pdf is opened using other variable and again close it检查该pdf是否使用其他变量打开并再次关闭它

    Same_file_with_another_variable.close()

Now open it only once and use it , you are good to go.现在只需打开一次并使用它,您就可以开始使用了。

PdfReadError PdfReadError
in ----> 1 read_pdf=PyPDF2.PdfFileReader(pdf_file) 在----> 1中read_pdf = PyPDF2.PdfFileReader(pdf_file)

PdfReadError: EOF marker not found PdfReadError:找不到EOF标记

I wanted to add my hacky solution to this issue.我想为这个问题添加我的 hacky 解决方案。

I had the same error with python requests (application/pdf).我对 python 请求(应用程序/pdf)有同样的错误。 In my case the provider (a shipping labeling service) did give a 200 and a b'string which represents the PDF, but in some random cases it missed the EOF marker.在我的情况下,提供商(运输标签服务)确实提供了一个 200 和一个代表 PDF 的 b'string,但在某些随机情况下,它错过了 EOF 标记。

Because it was random, I came up with the following solution:因为它是随机的,所以我想出了以下解决方案:

for obj in label_objects:
    get_label = api.get_label(label_id=obj.label_id)
    while not 'EOF' in str(get_label.content):
        get_label = api.get_label(label_id=obj.label_id)

At a few tries it gives the b'string with EOF and we're good to proceed.在几次尝试中,它给出了带有 EOF 的 b'string,我们很高兴继续。

PyPDF2 cannot find the EOF marker in a PDF that is encrypted. PyPDF2 无法在加密的 PDF 中找到 EOF 标记。

I came across the same error while I was working through the (excellent) Automate The Boring Stuff.我在处理(优秀的)Automate The Boring Stuff 时遇到了同样的错误。 Chapter 15, 2nd edition, page 355, project Combining Select Pages from Many PDFs.第 15 章,第 2 版,第 355 页,项目 Combining Select Pages from Many PDFs。

I chose to combine all the PDFs I had made during this chapter into one document and one of them was an encrypted PDF and the project failed when it got to the end of the encrypted document with the error message:我选择将我在本章中制作的所有 PDF 合并到一个文档中,其中一个是加密的 PDF,当它到达加密文档的末尾时项目失败并显示错误消息:

PyPDF2.utils.PdfReadError: EOF marker not found PyPDF2.utils.PdfReadError:找不到 EOF 标记

I moved the encrypted file to a different folder (so it would not be merged with the other pdfs and the project worked fine.我将加密文件移动到另一个文件夹(因此它不会与其他 pdf 合并并且项目运行良好。

So, it seems PyPDF2 cannot find the EOF marker in a PDF that is encrypted.因此,PyPDF2 似乎无法在加密的 PDF 中找到 EOF 标记。

i had the same problem.我有同样的问题。 For me the solution was to close the previously opened file before working with it again.对我来说,解决方案是在再次使用之前关闭之前打开的文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM