解除用户上传的 PDF 的最佳方法

Question

I am accepting PDFs as user input.我接受 PDF 作为用户输入。 I know that the PDFs being uploaded should not / do not need to contain any content types that could be used maliciously, like JS or AA.我知道上传的 PDF 不应该/不需要包含任何可能被恶意使用的内容类型，如 JS 或 AA。 For example, this is what a clean PDF should have (inspected using Didier Stevens PDFiD ):例如，这是一个干净的 PDF 应该具有的（使用 Didier Stevens PDFiD检查）：

    <Keyword Count="59" HexcodeCount="0" Name="obj"/>
    <Keyword Count="59" HexcodeCount="0" Name="endobj"/>
    <Keyword Count="19" HexcodeCount="0" Name="stream"/>
    <Keyword Count="19" HexcodeCount="0" Name="endstream"/>
    <Keyword Count="2" HexcodeCount="0" Name="xref"/>
    <Keyword Count="2" HexcodeCount="0" Name="trailer"/>
    <Keyword Count="2" HexcodeCount="0" Name="startxref"/>
    <Keyword Count="12" HexcodeCount="0" Name="/Page"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Encrypt"/>
    <Keyword Count="0" HexcodeCount="0" Name="/ObjStm"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JS"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JavaScript"/>
    <Keyword Count="0" HexcodeCount="0" Name="/AA"/>
    <Keyword Count="0" HexcodeCount="0" Name="/OpenAction"/>
    <Keyword Count="0" HexcodeCount="0" Name="/AcroForm"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JBIG2Decode"/>
    <Keyword Count="0" HexcodeCount="0" Name="/RichMedia"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Launch"/>
    <Keyword Count="0" HexcodeCount="0" Name="/EmbeddedFile"/>
    <Keyword Count="0" HexcodeCount="0" Name="/XFA"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Colors &gt; 2^24"/>

My current user input validation is to look at all the content types, and if any of the count is != 0 from "Encrypt" down, reject the PDF.我当前的用户输入验证是查看所有内容类型，如果从“加密”向下计数为 != 0，则拒绝 PDF。

I believe that sometimes when people hit "print to pdf", depending on the converter software used sometimes some of these content types get added.我相信有时当人们点击“打印到 pdf”时，取决于使用的转换器软件，有时会添加其中一些内容类型。 So I am currently rejecting PDFs, even when the suspicious content type is actually innocent.所以我目前拒绝 PDF，即使可疑内容类型实际上是无辜的。 Of course there is no way for me to determine if the JS is innocent or not, but I'd like to disarm the JS and continue with the file.当然，我无法确定 JS 是否无辜，但我想解除 JS 并继续处理文件。

Is there a way that I can take a PDF in memory then automatically disarm / defuse it, overwriting the previous file?有没有一种方法可以让我在内存中保存一个 PDF，然后自动解除/解除它，覆盖以前的文件？ I would like to do something like this我想做这样的事情

SuspectPDF = request.FILES['docfile'][0]
CleanPDF = disarmPDF(SuspectPDF)

I know that PDFiD has a disarm function but I'm not sure it can accomplish what I want in memory.我知道 PDFiD 具有解除武装功能，但我不确定它能否在内存中完成我想要的操作。 I am interested to know if there is another more commonly used solution for user input PDF validation and if there are any other things to be aware of here.我很想知道是否还有其他更常用的用户输入 PDF 验证解决方案，以及这里是否还有其他需要注意的事项。

Answer 1

The best way is to extract all content, markdown and instructions you need (texts, images, forms data, annotations, fonts etc.) and throw pdf away.最好的方法是提取您需要的所有内容、markdown 和说明（文本、图像、表单数据、注释、字体等）并将 pdf 扔掉。

Keyword-based solution will not work as every (even potentially armed) PDF will definitely have some keywords (like xref, obj/endobj etc) and may not have some others.基于关键字的解决方案将不起作用，因为每个（甚至可能武装的）PDF 肯定会有一些关键字（如外部参照、obj/endobj 等）并且可能没有其他一些关键字。 See PDF spec on file and document structure, different instructions etc.请参阅有关文件和文档结构、不同说明等的PDF 规范。

If you use python for content extraction have a look at the packages:如果您使用 python 进行内容提取，请查看软件包：

pdfreader pdf阅读器
pdfminer pdfminer
pyPdf .pdf
xpdf pdf
pdfbox pdfbox
mupdf pdf

解除用户上传的 PDF 的最佳方法

问题描述

1 个解决方案

解决方案1
0 2019-12-19 19:09:29

解除用户上传的 PDF 的最佳方法

问题描述

1 个解决方案

解决方案1 0 2019-12-19 19:09:29

解决方案1
0 2019-12-19 19:09:29