Best way to disarm user-uploaded PDFs

Question

I am accepting PDFs as user input. I know that the PDFs being uploaded should not / do not need to contain any content types that could be used maliciously, like JS or AA. For example, this is what a clean PDF should have (inspected using Didier Stevens PDFiD ):

    <Keyword Count="59" HexcodeCount="0" Name="obj"/>
    <Keyword Count="59" HexcodeCount="0" Name="endobj"/>
    <Keyword Count="19" HexcodeCount="0" Name="stream"/>
    <Keyword Count="19" HexcodeCount="0" Name="endstream"/>
    <Keyword Count="2" HexcodeCount="0" Name="xref"/>
    <Keyword Count="2" HexcodeCount="0" Name="trailer"/>
    <Keyword Count="2" HexcodeCount="0" Name="startxref"/>
    <Keyword Count="12" HexcodeCount="0" Name="/Page"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Encrypt"/>
    <Keyword Count="0" HexcodeCount="0" Name="/ObjStm"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JS"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JavaScript"/>
    <Keyword Count="0" HexcodeCount="0" Name="/AA"/>
    <Keyword Count="0" HexcodeCount="0" Name="/OpenAction"/>
    <Keyword Count="0" HexcodeCount="0" Name="/AcroForm"/>
    <Keyword Count="0" HexcodeCount="0" Name="/JBIG2Decode"/>
    <Keyword Count="0" HexcodeCount="0" Name="/RichMedia"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Launch"/>
    <Keyword Count="0" HexcodeCount="0" Name="/EmbeddedFile"/>
    <Keyword Count="0" HexcodeCount="0" Name="/XFA"/>
    <Keyword Count="0" HexcodeCount="0" Name="/Colors &gt; 2^24"/>

My current user input validation is to look at all the content types, and if any of the count is != 0 from "Encrypt" down, reject the PDF.

I believe that sometimes when people hit "print to pdf", depending on the converter software used sometimes some of these content types get added. So I am currently rejecting PDFs, even when the suspicious content type is actually innocent. Of course there is no way for me to determine if the JS is innocent or not, but I'd like to disarm the JS and continue with the file.

Is there a way that I can take a PDF in memory then automatically disarm / defuse it, overwriting the previous file? I would like to do something like this

SuspectPDF = request.FILES['docfile'][0]
CleanPDF = disarmPDF(SuspectPDF)

I know that PDFiD has a disarm function but I'm not sure it can accomplish what I want in memory. I am interested to know if there is another more commonly used solution for user input PDF validation and if there are any other things to be aware of here.

Answer 1

The best way is to extract all content, markdown and instructions you need (texts, images, forms data, annotations, fonts etc.) and throw pdf away.

Keyword-based solution will not work as every (even potentially armed) PDF will definitely have some keywords (like xref, obj/endobj etc) and may not have some others. See PDF spec on file and document structure, different instructions etc.

If you use python for content extraction have a look at the packages:

Best way to disarm user-uploaded PDFs

Question

1 answers

solution1
0 2019-12-19 19:09:29

Best way to disarm user-uploaded PDFs

Question

1 answers

solution1 0 2019-12-19 19:09:29

solution1
0 2019-12-19 19:09:29