简体   繁体   English

从 PDF 文档解析表格

[英]parse tables from a PDF document

The PDF in this link ( http://www.lenovo.com/psref/pdf/psref450.pdf ) contains a number of tables like this:此链接 ( http://www.lenovo.com/psref/pdf/psref450.pdf ) 中的 PDF 包含许多如下表格:

在此处输入图片说明

I'd like to programmatically extract the data and the structure from these tables.我想以编程方式从这些表中提取数据和结构。

Things I've tried: converting the PDF to HTML using我尝试过的事情:将 PDF 转换为 HTML 使用

  1. Tika : Unfortunately, the tables are converted to space delimited paragraphs - and some of the strings contain spaces so it's notpossible to split them. Tika :不幸的是,表格被转换为以空格分隔的段落 - 并且某些字符串包含空格,因此无法拆分它们。
  2. Python's PDFMiner : returned an assertion error due to missing fonts. Python 的 PDFMiner :由于缺少字体而返回断言错误。 I suspect the HTML would have been similar to the output from Tika,though I'll need to resolve the issue with the missing fonts to confirm this.我怀疑 HTML 会与 Tika 的输出类似,但我需要解决缺少字体的问题以确认这一点。
  3. Online tools : I tried http://www.zamzar.com/ and a couple of others.在线工具:我尝试了http://www.zamzar.com/和其他几个。 The file was either too big to process (for the online services) or it generated errors.该文件要么太大而无法处理(对于在线服务),要么产生了错误。

I was planning to convert the PDF to HTML and then parse it with BeautifulSoup.我打算将 PDF 转换为 HTML,然后用 BeautifulSoup 解析它。

The output could be JSON (eg one object per table), XML, or pretty much any format that maintains the structure.输出可以是 JSON(例如每个表一个对象)、XML 或几乎任何维护结构的格式。

You could try PDFBox. 你可以试试PDFBox。 The documentation for that is here: 该文档在这里:

https://pdfbox.apache.org/1.8/cookbook/textextraction.html https://pdfbox.apache.org/1.8/cookbook/textextraction.html

Extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. 扩展org.apache.pdfbox.pdfviewer.PDFPageDrawer并覆盖strokePath方法。 From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions. 从那里,您可以截取水平和垂直线段的绘制操作,并使用该信息确定列和行位置。 You can set up text regions to determine which numbers/letters/characters are drawn in which region. 您可以设置文本区域以确定在哪个区域中绘制哪些数字/字母/字符。 Since you know the layout of the regions are tabular you'll be able to define tables and tell which column and row the extracted text belongs to using simple algorithms. 由于您知道区域的布局是表格式的,因此您将能够使用简单算法定义表并告知提取的文本属于哪个列和行。

@alex-woolford: In general, perfect extraction of data (with or without the same formatting that you see in the PDF) is not always possible, thought it is, to some extent less than 100%. @ alex-woolford:一般来说,完全提取数据(有或没有你在PDF中看到的相同格式)并不总是可行的,认为它在某种程度上低于100%。 I'm saying this based on having worked on a similar project to yours, earlier. 我之前说这是基于与你的类似项目开展的工作。 I came across similar issues to what you have, and some research on the Net showed that PDF in general is not a perfectly reversible format, ie it is not always possible to recover the text and format from a PDF with 100% accuracy. 我遇到了与你所拥有的类似的问题,并且对网络的一些研究表明,PDF通常不是完全可逆的格式,即并不总是能够以100%的准确度从PDF中恢复文本和格式。 Sometimes characters even get lost, or transposed, and so on, during the extraction process (using some library). 有时字符甚至会在提取过程中丢失或转换等等(使用某些库)。 This seems to be due to the very nature of the PDF format and specification. 这似乎是由于PDF格式和规范的本质。 It is not a text-based format. 它不是基于文本的格式。 It is a derivative of PostScript and has some weird rules about layout of data. 它是PostScript的衍生产品,有一些关于数据布局的奇怪规则。 And this is according to official PDF documents, or according to the sites of product companies who have been working with PDF for a long time, and whose products are well known. 这是根据官方PDF文档,或根据长期使用PDF的产品公司的网站,其产品众所周知。

If less than perfect accuracy is tolerable, there are some products available (thought I don't know of any for Python, as of now). 如果可以接受的精度不够理想,那么可以使用一些产品(我认为我现在还不知道任何产品)。 One is xpdf and another is PDFTextStream. 一个是xpdf,另一个是PDFTextStream。 I've used the former, not the latter. 我用的是前者,而不是后者。 xpdf is a C library and also has command-line tools. xpdf是一个C库,也有命令行工具。 PDFTextStream is a Java tool/library. PDFTextStream是一个Java工具/库。 It was a paid product earlier, but last I checked, it is now free for single-threaded applications, IIRC. 它早些时候是付费产品,但最后我检查过,它现在可以免费用于单线程应用程序IIRC。

Even though xpdf is for C and PDFTextStream is for Java, you could call them from Python via XML-RPC or some other distributed computing / cross-language communication approach such as sockets. 即使xpdf用于C而PDFTextStream用于Java,您也可以通过XML-RPC或其他一些分布式计算/跨语言通信方法(如套接字)从Python调用它们。 Some work would be involved, for that, of course. 当然,也会涉及一些工作。

HTH. HTH。

Only FYI, as mine is not a publicly available tool: it sure is possible . 仅限FYI,因为我的工具不是公开的工具:这肯定是可能的 Here is this one table in plain text form -- the spaces in between are tabs, not spaces: 这是一个纯文本形式的表 - 中间的空格是制表符,而不是空格:

2469-2TU    i5-3320M    4GBx1   14.0" HD    720p    500G 7200   Intel 620528    WWAN upg    Express 54  Finger  BT  6   Win7 Pro64  10/12
✂ 2469-2SU  i5-3210M    4GBx1   14.0" HD    720p    500G 7200   Intel 2200  WWAN upg    Express 54  None    None    6   Win7 Pro64  10/12
✂ 2469-2RU  i3-3110M    4GBx1   14.0" HD    720p    320G 7200   Intel 2200  WWAN upg    Express 54  None    None    6   Win7 Pro64  10/12
2469-32U    i5-3230M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13
2469-2ZU    i5-3230M    4GBx1   14.0" HD    720p    320G 7200   Intel 2200  WWAN upg    None    None    None    6   Win7 Pro64  02/13
2469-2YU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13
2469-2XU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    None    None    6   Win7 Pro64  02/13
2469-2WU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   WLAN upg    WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13

I second PDFBox, as it works similar to my own hand-written utility: interrogate (x,y) positions, sort, then paste together "likely" strings and insert a tab when the horizontal space is larger than one would reasonably expect. 我的第二个PDFBox,因为它类似于我自己的手写实用程序:询问(x,y)位置,排序,然后将“可能”字符串粘贴在一起,并在水平空间大于合理预期时插入标签。

I even got the little Scissors in Zapf Dingbats :) 我甚至在Zapf Dingbats得到了小剪刀:)

parse tables from a PDF document using PDFplumber使用 PDFplumber 从 PDF 文档解析表格

import pdfplumber
import pandas as pd
filepath = r"actualFile_path"
outfile = r"destination_path"
pdf = pdfplumber.open(filepath)
for i in range(int(len(pdf.pages))):
      df = pd.DataFrame()
      table = pdf.pages[i].extract_table(table_settings=
      {"vertical_strategy": "text", "horizontal_strategy": "text"})
      df = pd.DataFrame(table, columns=table)
df.to_csv(outfile2, mode='a', index=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM