简体   繁体   English

从pdf中提取表格

[英]Extracting tables from a pdf

I'm trying to get the data from the tables in this PDF .我正在尝试从此PDF 中的表格中获取数据。 I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables.我试过 pdfminer 和 pypdf 有点运气,但我无法真正从表格中获取数据。

This is what one of the tables looks like:这是其中一张表的样子:在此处输入图片说明

As you can see, some columns are marked with an 'x'.如您所见,某些列标有“x”。 I'm trying to this table into a list of objects.我想把这个表变成一个对象列表。

This is the code so far, I'm using pdfminer now.这是到目前为止的代码,我现在正在使用 pdfminer。

# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os


def pdfToText(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos = set()

    records = []
    i = 1
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                                  caching=caching, check_extractable=True):
        # process page
        interpreter.process_page(page)

        # only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
        lines = retstr.getvalue().splitlines()

        idx = containsSubString(lines, 'Tool')
        lines = lines[idx+1:]
        idx = containsSubString(lines, "1 The 'All'")
        lines = lines[:idx]

        for line in lines:
            records.append(line)
        i += 1

    fp.close()
    device.close()
    retstr.close()

    return records


def containsSubString(list, substring):
    # find a substring in a list item
    for i, s in enumerate(list):
        if substring in s:
            return i
    return -1


# process pdf
fn = '../test1.pdf'
ft = 'test.txt'

text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
    outFile.write(text[i])
outFile.close()

That produces a text file and it gets all of the text but, the x's don't have the spacing preserved.这会生成一个文本文件并获取所有文本,但是 x 没有保留间距。 The output looks like this:输出如下所示:在此处输入图片说明

The x's are just single spaced in the text document x 在文本文档中只是单行距

Right now, I'm just producing text output but my goal is to produce an html document with the data from the tables.现在,我只是在生成文本输出,但我的目标是用表格中的数据生成一个 html 文档。 I've been searching for OCR examples, and most of them seem confusing or incomplete.我一直在寻找 OCR 示例,其中大多数看起来令人困惑或不完整。 I'm open to using C# or any other language that might produce the results I'm looking for.我愿意使用 C# 或任何其他可能产生我正在寻找的结果的语言。

EDIT: There will be multiple pdfs like this that I need to get the table data from.编辑:将有多个这样的pdf,我需要从中获取表格数据。 The headers will be the same for all pdfs (s far as I know).所有pdf的标题都相同(据我所知)。

I figured it out, I was going in the wrong direction.我想通了,我走错了方向。 What I did was create pngs of each table in the pdf and now I'm processing the images using opencv & python.我所做的是在 pdf 中创建每个表的 png,现在我正在使用 opencv 和 python 处理图像。

尝试使用Tabula ,如果它有效,请使用tabula-extractor 库(用 ruby​​ 编写)以编程方式提取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM