简体   繁体   English

如何使用PowerShell将PDF内容解析为数据库

[英]How to parse PDF content to database with powershell

I have a pdf document that I would like to extract content out of. 我有一个pdf文档,我想从中提取内容。 The issue I am having is this... I search for the IMEI keyword, and it finds it, but I need the actual IMEI value which is the next item in the loop. 我遇到的问题是这个...我搜索IMEI关键字,它找到了它,但我需要实际的IMEI值,它是循环中的下一个项目。

In the PDF the value looks like this: IMEI 90289393092 在PDF中,值如下所示:IMEI 90289393092

returning value via the below script: -0.1 -8.8 9.8 -0.1 446.7 403.9 Tm (IMEI:) Tj 通过以下脚本返回值:-0.1 -8.8 9.8 -0.1 446.7 403.9 Tm(IMEI :) Tj

I only want to have the value: 90289393092 我只想要价值:90289393092

Script I am using: 我正在使用的脚本:

Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\PDF\DOC001.pdf"

for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
 $lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
 foreach ($line in $lines) {
  if ($line -match "IMEI") { 
   $line = $line -replace "\\([\S])", $matches[1]
   $line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""

  }
 }
}

this is the way for using itextsharp.dll and read a pdf as plain text: 这是使用itextsharp.dll并将pdf作为纯文本读取的方法:

Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList c:\ps\a.pdf        

for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
    $strategy = new-object  'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'            
    $currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
    [string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default  , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));      
}
$Reader.Close();

And this can be the regex you need but I haven't tested it 这可能是你需要的正则表达式,但我还没有测试过

[regex]::matches( $text, '(?<=IMEI\s+)(\d+)(?=\s+)' ) | select -expa value

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM