简体   繁体   中英

Extracting tables, lines, and coordinates from PDF using PDFSharp/C#

I have several reports saved as PDF which contains several tables in between texts and images. I'm not sure if these tables are really tables or just lines. I tried to open these files using LibreOffice Writer and they were only lines but I'm still not sure if it is Writer's behavior on handling PDF's tables or just lines only.

How to make sure that these tables are really tables and how to extract them? If they were only lines, how to extract these lines and texts with their coordinates?

I'm using PDFSharp. Thanks for any help.

PDFsharp was not designed to extract text from PDF files.

In the PDFsharp forum you can find some code for that purpose:
http://forum.pdfsharp.net/viewtopic.php?p=1603#p1603
http://forum.pdfsharp.net/viewtopic.php?p=4010#p4010

There are no tables in PDF. There are instructions that draw text, there are instructions that draw lines.

I know that this is an old question but someone could need it

"Quite obvious" introduction:
PDF files are stream of graphics object (for example lines) and text. When the PDF is rendered the human eye understand that there are tables because of lines and text between them.

The (my) solution
Starting from a PDF reader (iTextSharp) you need to:
1. read the lines (hopefully only vertical and horizontal lines);
2. join the lines (a line of a table could be several lines, for example one per cell);
3. understand where the tables are (sometimes making some hypothesis based on your needs);
4. optionally find the text outside the tables (better to keep all the text) and insert it in paragraphs;
5. Insert text inside the cells of the table

If you need something already written to start from (working for my pdfs) you can find something here https://github.com/bubibubi/ExtractTablesFromPdf
It uses the GPL version of iTextSharp.
In this project there is the code to extract lines, boxes, tables and tables content.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM