简体   繁体   English

使用Powershell从PDF中剥离内容

[英]Using Powershell to Strip Content from PDF

Using Powershell to Strip Content from PDF While Keeping PDF Format. 使用Powershell在保留PDF格式的同时从PDF剥离内容。

My Task: I have been attempting to perform what would be a simple task if the documents were not in PDF format. 我的任务:如果文档不是PDF格式,我一直在尝试执行简单的任务。 I have a bunch of PDFs that have unwanted data before the bulk of usable data starts, this is anything that comes before '%PDF' in the documents. 在大量可用数据开始之前,我有一堆PDF包含不需要的数据,这是文档中'%PDF'之前的所有内容。 A script that pulls all the desired data and exports it to a new file was needed. 需要一个脚本来提取所有所需的数据并将其导出到新文件。 That part was super easy. 那部分非常容易。

The Problem: The data that is exported appears to be formatted correctly, except it doesn't open as a PDF anymore. 问题:导出的数据似乎已正确格式化,但不再以PDF格式打开。 I can open it in Notepad++ and it looks identical to one that was clean manually and works. 我可以在Notepad ++中打开它,它看起来与手动清理并可以正常工作的一样。 Examining the raw code of the Powershell altered PDF it appears that the 'lines' are much shorter than they should be. 查看Powershell修改后的PDF的原始代码,看来“行”比它们应该的短得多。

$Path = 'C:\FileLocation'
$Output = '.\MyFile.pdf'
$LineArr = @()

$Target = Get-ChildItem -Path $Path -Filter *.pdf -Recurse -ErrorAction SilentlyContinue | Get-Content -Encoding default | Out-String -stream


$Target.Where({ $_ -like '*%PDF*' }, 'SkipUntil') | ForEach-Object{
    If ($_.contains('%PDF')){
        $LineArr += "%" + $_.Split('%')[1]
    }
    else{
        $LineArr += $_
    }
}

$LineArr | Out-File -Encoding Default -FilePath $Output

I understand the PDF format doesn't really use lines, so that might be where the problem is being created. 我了解PDF格式并没有真正使用行,因此可能就是问题所在。 Either when the data is being initially put into an array, or when it's being written the PDF format is probably being broken. 最初将数据放入数组时,或者在写入数据时,PDF格式可能已损坏。 Is there a way to retain the format of the PDF while it is modified and then saved? 在修改然后保存时,是否可以保留PDF的格式? It's probably the case that I'm missing something simple. 我可能缺少一些简单的东西。

So I was about to start looking at iTextSharp and decided to give an older language a try first, Winbatch. 因此,我将开始研究iTextSharp,并决定尝试尝试一种较旧的语言,即Winbatch。 (bleh!) I almost made a screen scraper to do the work but the shame of taking that route got the better of me. (真好!)我差点用刮板刮刀来做这项工作,但是走那条路的耻辱使我变得更好。 So, the function library was the next stop. 因此,功能库是下一站。

This is just a little blurb I spit out with no error checking or logging going on at this point. 这只是我吐出的一点点内容,没有错误检查或登录。 All that will be added in along with file searches later. 所有这些将在以后与文件搜索一起添加。 All in all it manages to clear all the unwanted extras in the PDF but keeping the exact format that is required by PDFs. 总而言之,它可以清除PDF中所有多余的多余内容,但保留PDF所需的确切格式。

strPDFdoco = "C:\TestPDFs\Test.pdf"
strPDFString = "%%PDF"
strPDFendString = "%%%%END"
If FileExist(strPDFdoco)
        strPDFName = ItemExtract(-1, strPDFdoco, "\")
        strFixedPDFFullPath = ("C:\TestPDF\Fixed\": strPDFName)
        strCurrentPDFFileSize = FileSize(strPDFdoco) ; Get size of PDF file

        hndOldPDFFile = BinaryAlloc(strCurrentPDFFileSize) ; Allocate memory for reading PDF file
        BinaryRead(hndOldPDFFile, strPDFdoco) ; Read PDF file
        strStartIndex = BinaryIndexEx(hndOldPDFFile, 0, strPDFString, @FWDSCAN, @FALSE) ; Find start point for copy
        strEndIndex = BinaryEodGet(hndOldPDFFile) ; find eof
        strCount = strEndIndex - strStartIndex

        strWritePDF = BinaryWriteEx( hndOldPDFFile, strStartIndex, strFixedPDFFullPath, 0, strCount)
        BinaryFree(hndOldPDFFile)
    ENDIF

Now that I have an idea how this works, making a tool to do this in PS sounds more doable. 现在,我已经知道了它是如何工作的,使在PS中执行此操作的工具听起来更可行。 There's a PS function out there in the wild called Get-HexDump that might be a good base to educate myself on bits and hex in PS. 野外有一个名为Get-HexDump的PS函数,可能是一个很好的基础,可以让我自己学习PS中的位和十六进制。 Since this works in Winbatch I assume there is some sort of equivalent in AutoIt and it could be reproduced in most basic languages. 由于此方法在Winbatch中有效,因此我认为AutoIt中具有某种等效功能,并且可以用大多数基本语言进行复制。

There appears to be a lot of people out there trying to clear crud from before the header and after the end of their PDF docos, Hopefully this helps, I've got a half mill to hit with whatever script I morph this into. 似乎有很多人试图清除标头之前和PDF文档末尾后的粗俗内容,希望这会有所帮助,无论我将其变形为什么脚本,我都可以花一半的功夫。 I might update with a PS version if I decide to go that route again, and if I remember. 如果我决定再次走那条路线,并且记得的话,我可能会更新PS版本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM