简体   繁体   English

使用C#解析复杂的PDF文档

[英]Parsing Complex PDF document with C#

See attached K-1 Document. 见附件K-1文件。 I have attempted to use numerous tweaks with iTextSharp library but haven't had success in loading data correctly. 我试图对iTextSharp库进行大量调整,但没有成功正确加载数据。

Ideally I would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. 理想情况下,我想解析文档,类似于人类如何阅读它们,一次一个文本框,阅读其内容。

       var reader = new PdfReader(FILE, Encoding.ASCII.GetBytes(password));
        string[] lines;
        var strategy = new LocationTextExtractionStrategy();
        string currentPageText = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
        lines = currentPageText.Split(new string[] {"\r\n", "\n"}, StringSplitOptions.None);

I also tried playing with Annotation parsing but didn't have luck. 我也试过玩Annotation解析但没有运气。

I'm a newbie and probably looking at wrong place. 我是新手,可能看错了地方。 Can you help guide me in the right direction? 你能帮我指导正确的方向吗?

Thanks a lot. 非常感谢。

在此输入图像描述

The first question if this form is electronic or a scanned one? 第一个问题,如果这个表格是电子形式还是扫描形式? the latter would make the data extraction much harder as it should involve OCR too. 后者会使数据提取更加困难,因为它也应该涉及OCR。

in case you have electronic PDF and if you have all the similar forms then why don't you just use the following strategy: 如果您有电子PDF并且如果您拥有所有类似的表格,那么为什么不使用以下策略:

  • store coordinates of each "box" in the config file 存储配置文件中每个“框”的坐标
  • process documents and exract text from every "box" (ie region) 处理文档并从每个“框”(即区域)中提取文本
  • additional process extracted text with regular expressions to separate name from address (or maybe you may just set the region to read text from line by line) 附加进程使用正则表达式提取文本以将名称与地址分开(或者您可以将区域设置为逐行读取文本)

In case you have few variations of the form then you may check the very first box to extract the name of the form and load the appropraite settings file (that contains a set of regions for that variation) 如果您对表单的变化很少,那么您可以检查第一个框以提取表单的名称并加载适当的设置文件(包含该变体的一组区域)

This approach should work with any PDF library. 此方法适用于任何PDF库。

You would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. 想要解析文档,类似于人类如何阅读它们,一次一个文本框,阅读其内容。 That means you first will have to try and automatically recognize those text boxes. 这意味着您首先必须尝试自动识别这些文本框。 Then you can extract text by these areas. 然后,您可以按这些区域提取文本。

To recognize those text boxes automatically in your document, you have to extract the border lines enclosing the boxes. 要在文档中自动识别这些文本框 ,您必须提取包围框的边框线。 For this you will first have to find out how those border lines are created. 为此,您首先要了解如何创建边框线。 They might be drawn using vector graphics as lines or rectangles, but they could also be part of a background bitmap image. 它们可能使用矢量图形作为线条或矩形绘制,但它们也可以是背景位图图像的一部分。

Unfortunately I don't have your IRS form at hand and so cannot analyze its internals. 不幸的是,我手头没有你的IRS表格,因此无法分析其内部结构。 Let's assume the borders are created using vector graphics for now. 我们假设现在使用矢量图形创建边框。 Thus, you have to extract vector graphics. 因此,您必须提取矢量图形。

To extract vector graphics with iText(Sharp) , you make use of classes from the iText(Sharp) parser namespace by making them parse the document and feed the parsing events into a listener you create which collects the vector graphic operations: 使用iText(Sharp)提取矢量图形 ,您可以使用iText(Sharp)解析器命名空间中的类,使它们解析文档并将解析事件提供给您创建的收集矢量图形操作的侦听器:

  • You implement IExtRenderListener , in particular its ModifyPath and RenderPath methods which respectively are called when additional path elements (eg lines or rectangles) are added to the current path or when the current path is rendered (stroked? filled?). 您实现IExtRenderListener ,特别是其ModifyPathRenderPath方法,当将其他路径元素(例如行或矩形)添加到当前路径或当前路径被渲染(描边?填充?)时,分别调用这些方法。 Your implementation collects these information. 您的实施收集这些信息。
  • You parse your document into an instance of your listener, eg using PdfReaderContentParser . 您将文档解析为侦听器的实例,例如使用PdfReaderContentParser
  • You analyse the lines and rectangles found and derive the coordinates of the boxes they build. 您可以分析找到的直线和矩形,并导出它们构建的框的坐标。
  • You parse the same page in a LocationTextExtractionStrategy instance. 您在LocationTextExtractionStrategy实例中解析相同的页面。
  • You retrieve the texts of the recognized text boxes by calling LocationTextExtractionStrategy.GetResultantText with a matching ITextChunkFilter argument for each box. 通过调用LocationTextExtractionStrategy.GetResultantText以及每个框的匹配ITextChunkFilter参数来检索已识别文本框的文本。

(Actually you can do the parsing into the instance of your listener and the LocationTextExtractionStrategy instance in one pass for a bit of optimization.) (实际上,您可以在一次传递中对解析器实例和LocationTextExtractionStrategy实例进行解析,以进行一些优化。)

All iText(Sharp) specific tasks are trivial, and the only other task, the analysis of the lines and rectangles found to derive the coordinates of the boxes, should be no big problem for a software developer proficient in C#. 所有iText(夏普)特定任务都是微不足道的,唯一的另一项任务,分析线条和矩形,以获得框的坐标,对于精通C#的软件开发人员来说应该不是什么大问题。

Take a look at IvyPdf library and template editor. 看看IvyPdf库和模板编辑器。 It's using c# and provides high-level functions to parse and extract data so you don't have to deal with internals of PDF documents. 它使用c#并提供高级函数来解析和提取数据,因此您不必处理PDF文档的内部。 You can build fairly complex scenarios using it. 您可以使用它构建相当复杂的场景。

I don't think it can read annotations though. 我不认为它可以读取注释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM