简体   繁体   中英

Detecting Headers and Borders in PDF Tables using PDF Clown

I am using PDF Clown's TextInfoExtractionSample to extract a PDF table into Excel and I was able to do it except merged cells. In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders. Anyone know what object represents borders in PDF table OR how to detect if a text is a header of the table?

   private void Extract(ContentScanner level, PrimitiveComposer composer)
   {
      if(level == null)
        return;
      while(level.MoveNext())
      {
        ContentObject content = level.Current;
      }
    }

I am using PDF Clown's TextInfoExtractionSample ...

In the below code, for object, "content" I see the scanned content as text, XObject, ContainerObject but nothing for borders.

 while(level.MoveNext()) { ContentObject content = level.Current; } 

A) Visit all content

In your loop code you removed very important blocks from the original example,

    if(content is XObject)
    {
        // Scan the external level!
        Extract(((XObject)content).GetScanner(level), composer);
    }

and

    if(content is ContainerObject)
    {
        // Scan the inner level!
        Extract(level.ChildLevel, composer);
    }

These blocks make the sample recurse into complex objects (the XObject, ContainerObject you mention) which in turn contain their own simple content.

B) Inspect all content

Anyone know what object represents borders in PDF table

Unfortunately there is nothing like a border attribute in PDF content. Instead, borders are independent objects, usually vector graphics, either lines or very thin rectangles.

Thus, while scanning the page content (recursively, as indicated in A ) you will have to look for Path instances (namespace org.pdfclown.documents.contents.objects ) containing

  • moveTo m , lineTo l , and stroke S operations or
  • rectangle re and fill f operations.

( This answer may help)

When you come across such lines, you will have to interpret them. These lines may be borders, but they may also be used as underlines, page decorations, ...

If the PDF happens to be tagged, things may be a bit easier insofar as you have to interpret less. Instead you can read the tagging information which may tell you where a cell starts and ends, so you do not need to interpret graphical lines. Unfortunately still less PDFs are tagged than not.

OR how to detect if a text is a header of the table?

Just as above, unless you happen to inspect a tagged PDF, there is nothing immediately telling you some text is a table header. You have to interpret again. Is that text outside of lines you determined to form a table? Is it inside at the top? Or just anywhere inside? Is it drawn in a specific font? Or larger? Different color? Etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM