简体   繁体   English

如何使用itextsharp从pdf获取特定的列值

[英]How to get a particular column value from pdf using itextsharp

I have a PDF in which data is displayed in a table. 我有一个PDF,其中的数据显示在表格中。 In this table, I have multiple columns, but I want to get particular column values as a list. 在此表中,我有多个列,但是我想获取特定的列值作为列表。 Is this possible? 这可能吗?

This is my code: 这是我的代码:

PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
}
pdfReader.Close();
return text.ToString();

With this code, I get all of the text of the PDF, but I want a particular column of data. 有了这段代码,我得到了PDF的所有文本,但是我想要一列特定的数据。 The column name is "Date". 列名称为“日期”。

This is much more complicated than you would think. 这比您想象的要复杂得多。 A PDF document does not (always) contain structure information. PDF文档不(总是)包含结构信息。 It only has instructions that a viewer needs to render the document. 它仅包含查看者需要呈现文档的指令。

Imagine something like: 想象一下:

go to 50, 50 转到50、50
use font Helvetica Bold 使用字体Helvetica Bold
draw the glyph for character 'H' 绘制字符“ H”的字形
go to 56, 50 转到56,50
draw the glyph for character 'e' 绘制字符“ e”的字形

These instructions do not even need to appear in logical reading order. 这些指令甚至不需要按逻辑阅读顺序出现。 As a result, determining what makes up a logical table, based on instructions is very hard. 结果,很难根据指令确定组成逻辑表的内容。

Possible approach (if your table contains enough lines): 可能的方法(如果您的表包含足够的行):

  • use IEventListener to be notified of PathRenderInfo and TextRenderInfo 使用IEventListener来通知PathRenderInfo和TextRenderInfo
  • gather PathRenderInfo into lines 将PathRenderInfo收集到行中
  • gather lines into clusters if (and only if) they cross at 90° angles 当(且仅)当它们以90°角交叉时,将线聚集成簇
  • determine number of rows and columns from such line clusters 确定此类线簇的行数和列数
  • assume something is a table if (and only if) it consists of enough rows and columns and has some text in it 假设某事物是一个表,当(且仅当)它由足够的行和列组成并且其中包含一些文本时

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM