简体   繁体   English

使用PDF小丑检测PDF表中的页眉和边框

[英]Detecting Headers and Borders in PDF Tables using PDF Clown

I am using PDF Clown's TextInfoExtractionSample to extract a PDF table into Excel and I was able to do it except merged cells. 我正在使用PDF Clown的TextInfoExtractionSample将PDF表提取到Excel中,除了合并的单元格之外,我能够做到这一点。 In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders. 在下面的代码中,对于对象“内容”,我将扫描的内容视为文本,XObject,ContainerObject,但对于边框则什么也没有。 Anyone know what object represents borders in PDF table OR how to detect if a text is a header of the table? 任何人都知道哪个对象代表PDF表格中的边框,或者如何检测文本是否为表格的标题?

   private void Extract(ContentScanner level, PrimitiveComposer composer)
   {
      if(level == null)
        return;
      while(level.MoveNext())
      {
        ContentObject content = level.Current;
      }
    }

I am using PDF Clown's TextInfoExtractionSample ... 我正在使用PDF Clown的TextInfoExtractionSample ...

In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders. 在下面的代码中,对于对象“内容”,我将扫描的内容视为文本,XObject,ContainerObject,但对于边框则什么也没有。

 while(level.MoveNext()) { ContentObject content = level.Current; } 

A) Visit all content A)访问所有内容

In your loop code you removed very important blocks from the original example, 在循环代码中,您从原始示例中删除了非常重要的块,

    if(content is XObject)
    {
        // Scan the external level!
        Extract(((XObject)content).GetScanner(level), composer);
    }

and

    if(content is ContainerObject)
    {
        // Scan the inner level!
        Extract(level.ChildLevel, composer);
    }

These blocks make the sample recurse into complex objects (the XObject, ContainerObject you mention) which in turn contain their own simple content. 这些块使样本递归到复杂的对象(您提到的XObject,ContainerObject )中,这些对象又包含它们自己的简单内容。

B) Inspect all content B)检查所有内容

Anyone know what object represents borders in PDF table 任何人都知道什么对象代表PDF表格中的边框

Unfortunately there is nothing like a border attribute in PDF content. 不幸的是,PDF内容中没有像border属性那样的东西。 Instead, borders are independent objects, usually vector graphics, either lines or very thin rectangles. 取而代之的是,边框是独立的对象,通常是矢量图形,是线条还是非常细的矩形。

Thus, while scanning the page content (recursively, as indicated in A ) you will have to look for Path instances (namespace org.pdfclown.documents.contents.objects ) containing 因此,在扫描页面内容时(递归地,如A所示 ),您将必须查找包含以下内容的Path实例(名称空间org.pdfclown.documents.contents.objects

  • moveTo m , lineTo l , and stroke S operations or moveTo mlineTo l笔划 S操作或
  • rectangle re and fill f operations. 矩形 refill f操作。

( This answer may help) 此答案可能会有所帮助)

When you come across such lines, you will have to interpret them. 当您遇到这样的界限时,您将不得不对其进行解释 These lines may be borders, but they may also be used as underlines, page decorations, ... 这些线可能是边框,但也可以用作下划线,页面装饰,...

If the PDF happens to be tagged, things may be a bit easier insofar as you have to interpret less. 如果PDF刚好被加了标签,那么就可以使事情变得容易一些,因为您无需多解释。 Instead you can read the tagging information which may tell you where a cell starts and ends, so you do not need to interpret graphical lines. 相反,您可以阅读标记信息,该信息可以告诉您单元格的开始和结束位置,因此您无需解释图形线。 Unfortunately still less PDFs are tagged than not. 不幸的是,加标签的PDF还是少于。

OR how to detect if a text is a header of the table? 或者如何检测文本是否为表格的标题?

Just as above, unless you happen to inspect a tagged PDF, there is nothing immediately telling you some text is a table header. 就像上面一样,除非您碰巧检查了带标签的PDF,否则没有任何内容可以立即告诉您某些文本是表格标题。 You have to interpret again. 您必须再次解释。 Is that text outside of lines you determined to form a table? 该文本是否超出您确定要形成表格的行数? Is it inside at the top? 它在顶部吗? Or just anywhere inside? 还是在里面的任何地方? Is it drawn in a specific font? 是否以特定字体绘制? Or larger? 或更大? Different color? 不同的颜色? Etc. 等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM