简体   繁体   English

是否可以使用 iText 获取现有 PDF 文档的元素?

[英]Is it possible to get elements of an existing PDF document using iText?

I want to read and get elements of an existing PDF document using iText API.我想使用 iText API 读取和获取现有 PDF 文档的元素。 Example: a document contain a PDF table, I want to get that table when reading document.示例:文档包含一个 PDF 表格,我想在阅读文档时获取该表格。

Directly and easily, no.直接和轻松,不。

If you're willing to put in work, it depends.如果你愿意投入工作,这取决于。

If you're willing to put in a lot of work, yes.如果你愿意投入大量的工作,是的。

Allow me to elaborate.请允许我详细说明。 There are 2 flavors of PDF specification. PDF 规范有 2 种风格。 Tagged and untagged PDF.标记和未标记的 PDF。 When a PDF is tagged, it means that all the structure information is preserved.当 PDF 被标记时,意味着所有结构信息都被保留。 Every character belongs to a line, every line belongs to a paragraph, and tables, lists (and other structure elements) know which lines and paragraphs are contained within them.每个字符属于一行,每一行属于一个段落,表格、列表(和其他结构元素)知道其中包含哪些行和段落。

If you have an untagged PDF, it contains only the instructions needed for rendering the document.如果您有未加标签的 PDF,则它仅包含呈现文档所需的说明。 You can imagine this as你可以把这想象成

go to position 50, 50转到位置 50, 50
set the font to Arial Unicode将字体设置为 Arial Unicode
set the font size to 12将字体大小设置为 12
draw the character 'H'绘制字符'H'

This is where the solution depends on the amount of work.这是解决方案取决于工作量的地方。 If your PDF is tagged, you can use iText to extract the tagging information, and this allows you to rebuild a structural concept of PdfTable.如果你的 PDF 有标签,你可以使用 iText 提取标签信息,这样你就可以重建 PdfTable 的结构概念。 (you can also use IEventListener to find the font that was used, the font size, etc) (您还可以使用 IEventListener 查找使用的字体、字体大小等)

If your PDF is untagged, you can attempt to find the structure in the rendering instructions.如果您的 PDF 未标记,您可以尝试在渲染说明中查找结构。

This is a hard problem.这是一个难题。 The topic of research even.甚至研究的课题。 Two main approaches seem to exist in current research:目前的研究中似乎存在两种主要方法:

  • Rule based (characters are considered part of the same line if their distance is smaller than a given epsilon, and their y-position is roughly the same within a given margin, etc)基于规则(如果字符的距离小于给定的 epsilon,并且它们的 y 位置在给定的边距内大致相同,则字符被视为同一行的一部分,等等)
  • Neural network ("render the PDF" and treat the image as the input for an image classification network)神经网络(“渲染 PDF”并将图像视为图像分类网络的输入)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM