简体   繁体   English

解除用户上传的 PDF 的最佳方法

[英]Best way to disarm user-uploaded PDFs

I am accepting PDFs as user input.我接受 PDF 作为用户输入。 I know that the PDFs being uploaded should not / do not need to contain any content types that could be used maliciously, like JS or AA.我知道上传的 PDF 不应该/不需要包含任何可能被恶意使用的内容类型,如 JS 或 AA。 For example, this is what a clean PDF should have (inspected using Didier Stevens PDFiD ):例如,这是一个干净的 PDF 应该具有的(使用 Didier Stevens PDFiD检查):

    <Keyword Count="59" HexcodeCount="0" Name="obj"/>
    <Keyword Count="59" HexcodeCount="0" Name="endobj"/>
    <Keyword Count="19" HexcodeCount="0" Name="stream"/>
    <Keyword Count="19" HexcodeCount="0" Name="endstream"/>
    <Keyword Count="2" HexcodeCount="0" Name="xref"/>
    <Keyword Count="2" HexcodeCount="0" Name="trailer"/>
    <Keyword Count="2" HexcodeCount="0" Name="startxref"/>
    <Keyword Count="12" HexcodeCount="0" Name="/Page"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Encrypt"/>
    <Keyword Count="0" HexcodeCount="0" Name="/ObjStm"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JS"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JavaScript"/>
    <Keyword Count="0" HexcodeCount="0" Name="/AA"/>
    <Keyword Count="0" HexcodeCount="0" Name="/OpenAction"/>
    <Keyword Count="0" HexcodeCount="0" Name="/AcroForm"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JBIG2Decode"/>
    <Keyword Count="0" HexcodeCount="0" Name="/RichMedia"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Launch"/>
    <Keyword Count="0" HexcodeCount="0" Name="/EmbeddedFile"/>
    <Keyword Count="0" HexcodeCount="0" Name="/XFA"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Colors &gt; 2^24"/>

My current user input validation is to look at all the content types, and if any of the count is != 0 from "Encrypt" down, reject the PDF.我当前的用户输入验证是查看所有内容类型,如果从“加密”向下计数为 != 0,则拒绝 PDF。

I believe that sometimes when people hit "print to pdf", depending on the converter software used sometimes some of these content types get added.我相信有时当人们点击“打印到 pdf”时,取决于使用的转换器软件,有时会添加其中一些内容类型。 So I am currently rejecting PDFs, even when the suspicious content type is actually innocent.所以我目前拒绝 PDF,即使可疑内容类型实际上是无辜的。 Of course there is no way for me to determine if the JS is innocent or not, but I'd like to disarm the JS and continue with the file.当然,我无法确定 JS 是否无辜,但我想解除 JS 并继续处理文件。

Is there a way that I can take a PDF in memory then automatically disarm / defuse it, overwriting the previous file?有没有一种方法可以让我在内存中保存一个 PDF,然后自动解除/解除它,覆盖以前的文件? I would like to do something like this我想做这样的事情

SuspectPDF = request.FILES['docfile'][0]
CleanPDF = disarmPDF(SuspectPDF)

I know that PDFiD has a disarm function but I'm not sure it can accomplish what I want in memory.我知道 PDFiD 具有解除武装功能,但我不确定它能否在内存中完成我想要的操作。 I am interested to know if there is another more commonly used solution for user input PDF validation and if there are any other things to be aware of here.我很想知道是否还有其他更常用的用户输入 PDF 验证解决方案,以及这里是否还有其他需要注意的事项。

The best way is to extract all content, markdown and instructions you need (texts, images, forms data, annotations, fonts etc.) and throw pdf away.最好的方法是提取您需要的所有内容、markdown 和说明(文本、图像、表单数据、注释、字体等)并将 pdf 扔掉。

Keyword-based solution will not work as every (even potentially armed) PDF will definitely have some keywords (like xref, obj/endobj etc) and may not have some others.基于关键字的解决方案将不起作用,因为每个(甚至可能武装的)PDF 肯定会有一些关键字(如外部参照、obj/endobj 等)并且可能没有其他一些关键字。 See PDF spec on file and document structure, different instructions etc.请参阅有关文件和文档结构、不同说明等的PDF 规范

If you use python for content extraction have a look at the packages:如果您使用 python 进行内容提取,请查看软件包:

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Flask - 将用户上传的图像提供给网页 - Flask - Serving user-uploaded images to the webpage 使用 Dokku 制作 Django 用户上传的媒体文件 - Django user-uploaded media files in production with Dokku 如何在Django中对用户上传的文件应用读/写权限 - How to apply read/write permissions to user-uploaded files in Django 如何链接到Django模板中的用户上传的个人资料图片(ImageField)? - How to link to user-uploaded profile pictures (ImageField) in Django templates? 如何在 Django 中提供用户上传的 pdf 文件? - How to Serve User-Uploaded pdf Files in Django? 用户上传的内容在我的AWS S3存储中会放在哪里? - Where does user-uploaded content go in my AWS S3 storage? 如何放置一种格式,在该格式中分析用户上传的文件,然后将其显示在新的url上? - How can I put in a form where user-uploaded files are parsed and then displayed on a new url? 如何将特定用户分配给用户上传的文件,以便他们可以对其进行修改/删除(Django + Apache) - How do I assign specific users to a user-uploaded file so they can modify it/delete it (Django + Apache) 有什么办法可以使用python从zip内部提供用户上传的图片? - Any way to serve user uploaded images from inside zips with python? 在Django中为用户上传的项目生成随机URL的正确方法是什么? - What is the right way in Django to generate random URLs for user uploaded items?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM