简体繁体 English

使用PDFSharp / C＃从PDF提取表，线和坐标

[英]Extracting tables, lines, and coordinates from PDF using PDFSharp/C#

原文 2015-07-23 06:42:47 4 2 c#/ pdfsharp

I have several reports saved as PDF which contains several tables in between texts and images. 我有一些报告另存为PDF，其中包含在文本和图像之间的多个表格。 I'm not sure if these tables are really tables or just lines. 我不确定这些表是真的表还是行。 I tried to open these files using LibreOffice Writer and they were only lines but I'm still not sure if it is Writer's behavior on handling PDF's tables or just lines only. 我尝试使用LibreOffice Writer打开这些文件，它们只是行，但是我仍然不确定这是Writer处理PDF表时的行为还是仅行。

How to make sure that these tables are really tables and how to extract them? 如何确保这些表确实是表，以及如何提取它们？ If they were only lines, how to extract these lines and texts with their coordinates? 如果它们只是线条，那么如何使用其坐标提取这些线条和文本？

I'm using PDFSharp. 我正在使用PDFSharp。 Thanks for any help. 谢谢你的帮助。

2 个解决方案

PDFsharp was not designed to extract text from PDF files. PDFsharp并非旨在从PDF文件提取文本。

In the PDFsharp forum you can find some code for that purpose: 在PDFsharp论坛中，您可以找到用于此目的的一些代码：
http://forum.pdfsharp.net/viewtopic.php?p=1603#p1603 http://forum.pdfsharp.net/viewtopic.php?p=1603#p1603
http://forum.pdfsharp.net/viewtopic.php?p=4010#p4010 http://forum.pdfsharp.net/viewtopic.php?p=4010#p4010

There are no tables in PDF. PDF中没有表格。 There are instructions that draw text, there are instructions that draw lines. 有一些指令可以绘制文本，有一些指令可以绘制线条。

I know that this is an old question but someone could need it 我知道这是一个古老的问题，但有人可能需要

"Quite obvious" introduction: “相当明显”的介绍：
PDF files are stream of graphics object (for example lines) and text. PDF文件是图形对象（例如线条）和文本的流。 When the PDF is rendered the human eye understand that there are tables because of lines and text between them. 呈现PDF时，人眼会意识到存在表格，因为它们之间存在线条和文字。

The (my) solution （我的）解决方案
Starting from a PDF reader (iTextSharp) you need to: 从PDF阅读器（iTextSharp）开始，您需要：
1. read the lines (hopefully only vertical and horizontal lines); 1.阅读线条（希望只有垂直和水平线条）；
2. join the lines (a line of a table could be several lines, for example one per cell); 2.连接行（表的一行可以是多行，例如每个单元格一行）；
3. understand where the tables are (sometimes making some hypothesis based on your needs); 3.了解表格的位置（有时根据您的需要做出一些假设）；
4. optionally find the text outside the tables (better to keep all the text) and insert it in paragraphs; 4.（可选）在表格外查找文本（最好保留所有文本）并将其插入段落中；
5. Insert text inside the cells of the table 5.在表格的单元格中插入文本

If you need something already written to start from (working for my pdfs) you can find something here https://github.com/bubibubi/ExtractTablesFromPdf 如果您需要一些已经写好的内容（适用于我的pdf），则可以在这里找到一些内容https://github.com/bubibubi/ExtractTablesFromPdf
It uses the GPL version of iTextSharp. 它使用iTextSharp的GPL版本。
In this project there is the code to extract lines, boxes, tables and tables content. 在此项目中，有提取行，框，表和表内容的代码。