简体   繁体   English

无法使用 iTextSharp 读取 pdf 文件中特定位置的文本

[英]Unable to read text in a specific location in a pdf file using iTextSharp

I'm given to read a pdf texts and do some stuffs are extracting the texts.我被要求阅读 pdf 文本并做一些事情正在提取文本。 I 'm using iTextSharp to read the PDF.我正在使用 iTextSharp 读取 PDF。 The problem here is that the PdfTextExtractor.GetTextFromPage doesnt give me all the contents of the page.这里的问题是 PdfTextExtractor.GetTextFromPage 没有给我页面的所有内容。 For ex例如

在此处输入图像描述

In the above PDF I m unable to read texts that are highlighted in blue .在上面的 PDF 中,我无法阅读以蓝色突出显示的文本 Rest of the characters I m able t read.我无法阅读的字符的 Rest。 Below is the line that does the above下面是执行上述操作的行

           `string filePath = "myFile path";
            PdfReader pdfReader = new PdfReader(filePath);
            for (int page = 1; page<=1; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            }`

Any suggestions here?这里有什么建议吗?

I have went through lots of queries and solution in SO but not specific to this query.我在 SO 中经历了很多查询和解决方案,但并不特定于这个查询。

The reason for text extraction not extracting those texts is pretty simple: Those texts are not part of the static page content but form fields, But "Text extraction" in iText (and other PDF libraries I know. too) is considered to mean "extraction of the text of the static page content", Thus.文本提取不提取这些文本的原因很简单:这些文本不是 static 页面内容的一部分,而是表单字段,但是 iText 中的“文本提取”(以及其他 ZBCD1B68617759B1DFCFF0403A6B5 库) static 页面内容的文本”,因此。 those texts you miss simply are not subject to text extraction.您错过的那些文本不受文本提取的影响。

If you want to make form field values subject to your text extraction code, too, you first have to flatten the form field visualizations.如果您想让表单域值也受文本提取代码的约束,您首先必须展平表单域可视化。 "Flattening" here means making them part of the static page content and dropping all their form field dynamics.这里的“扁平化”意味着使它们成为 static 页面内容的一部分,并删除所有表单字段动态。

You can do that by adding after reading the PDF in this line您可以在阅读此行中的 PDF 后添加

PdfReader pdfReader = new PdfReader(filePath);

code to flatten this PDF and loading the flattened PDF into the pdfReader , eg like this:代码来展平这个 PDF 并将展平的 PDF 加载到pdfReader中,例如像这样:

MemoryStream memoryStream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, memoryStream);
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();

memoryStream.Position = 0;
pdfReader = new PdfReader(memoryStream);

Extracting the text from this re-initialized pdfReader will give you the text from the form fields, too.从这个重新初始化的pdfReader中提取文本也会为您提供表单字段中的文本。

Unfortunately, the flattened form text is added at the end of the content stream.不幸的是,在内容 stream 的末尾添加了扁平化的表单文本。 As your chosen text extraction strategy SimpleTextExtractionStrategy simply returns the text in the order it is drawn, the former form fields contents all are extracted at the end.由于您选择的文本提取策略SimpleTextExtractionStrategy只是按照绘制的顺序返回文本,因此之前的表单字段内容都在最后被提取。

You can change this by using a different text extraction strategy, ie by replacing this line:您可以通过使用不同的文本提取策略来更改此设置,即替换此行:

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
  • Using the LocationTextExtractionStrategy (which is part of the iText distribution) already returns a better result;使用LocationTextExtractionStrategy (它是 iText 发行版的一部分)已经返回了更好的结果; unfortunately the form field values are not exactly on the same base line as the static contents we perceive to be on the same line, so there are some unexpected line breaks.不幸的是,表单字段值与我们认为在同一行的 static 内容并不完全相同,因此出现了一些意外的换行符。

     ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
  • Using the HorizontalTextExtractionStrategy (from this answer which contains both a Java and a C# version thereof) the result is even better.使用 Horizo HorizontalTextExtractionStrategy (来自包含 Java 和 C# 版本的此答案)结果会更好。 Beware, though, this strategy is not universally better, read the warnings in the answer text.但请注意,这种策略并非普遍更好,请阅读答案文本中的警告。

     ITextExtractionStrategy strategy = new HorizontalTextExtractionStrategy();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM