简体   繁体   中英

Using Powershell to Strip Content from PDF

Using Powershell to Strip Content from PDF While Keeping PDF Format.

My Task: I have been attempting to perform what would be a simple task if the documents were not in PDF format. I have a bunch of PDFs that have unwanted data before the bulk of usable data starts, this is anything that comes before '%PDF' in the documents. A script that pulls all the desired data and exports it to a new file was needed. That part was super easy.

The Problem: The data that is exported appears to be formatted correctly, except it doesn't open as a PDF anymore. I can open it in Notepad++ and it looks identical to one that was clean manually and works. Examining the raw code of the Powershell altered PDF it appears that the 'lines' are much shorter than they should be.

$Path = 'C:\FileLocation'
$Output = '.\MyFile.pdf'
$LineArr = @()

$Target = Get-ChildItem -Path $Path -Filter *.pdf -Recurse -ErrorAction SilentlyContinue | Get-Content -Encoding default | Out-String -stream


$Target.Where({ $_ -like '*%PDF*' }, 'SkipUntil') | ForEach-Object{
    If ($_.contains('%PDF')){
        $LineArr += "%" + $_.Split('%')[1]
    }
    else{
        $LineArr += $_
    }
}

$LineArr | Out-File -Encoding Default -FilePath $Output

I understand the PDF format doesn't really use lines, so that might be where the problem is being created. Either when the data is being initially put into an array, or when it's being written the PDF format is probably being broken. Is there a way to retain the format of the PDF while it is modified and then saved? It's probably the case that I'm missing something simple.

So I was about to start looking at iTextSharp and decided to give an older language a try first, Winbatch. (bleh!) I almost made a screen scraper to do the work but the shame of taking that route got the better of me. So, the function library was the next stop.

This is just a little blurb I spit out with no error checking or logging going on at this point. All that will be added in along with file searches later. All in all it manages to clear all the unwanted extras in the PDF but keeping the exact format that is required by PDFs.

strPDFdoco = "C:\TestPDFs\Test.pdf"
strPDFString = "%%PDF"
strPDFendString = "%%%%END"
If FileExist(strPDFdoco)
        strPDFName = ItemExtract(-1, strPDFdoco, "\")
        strFixedPDFFullPath = ("C:\TestPDF\Fixed\": strPDFName)
        strCurrentPDFFileSize = FileSize(strPDFdoco) ; Get size of PDF file

        hndOldPDFFile = BinaryAlloc(strCurrentPDFFileSize) ; Allocate memory for reading PDF file
        BinaryRead(hndOldPDFFile, strPDFdoco) ; Read PDF file
        strStartIndex = BinaryIndexEx(hndOldPDFFile, 0, strPDFString, @FWDSCAN, @FALSE) ; Find start point for copy
        strEndIndex = BinaryEodGet(hndOldPDFFile) ; find eof
        strCount = strEndIndex - strStartIndex

        strWritePDF = BinaryWriteEx( hndOldPDFFile, strStartIndex, strFixedPDFFullPath, 0, strCount)
        BinaryFree(hndOldPDFFile)
    ENDIF

Now that I have an idea how this works, making a tool to do this in PS sounds more doable. There's a PS function out there in the wild called Get-HexDump that might be a good base to educate myself on bits and hex in PS. Since this works in Winbatch I assume there is some sort of equivalent in AutoIt and it could be reproduced in most basic languages.

There appears to be a lot of people out there trying to clear crud from before the header and after the end of their PDF docos, Hopefully this helps, I've got a half mill to hit with whatever script I morph this into. I might update with a PS version if I decide to go that route again, and if I remember.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM