简体   繁体   English

从pdf中提取数据的最佳方法是什么

[英]what is the best way to extract data from pdf

I have thousands of pdf file that I need to extract data from.This is an example pdf .我有数千个 pdf 文件需要从中提取数据。这是一个示例pdf I want to extract this information from the example pdf.我想从示例 pdf 中提取此信息。

在此处输入图片说明

I am open to nodejs, python or any other effective method.我对 nodejs、python 或任何其他有效方法持开放态度。 I have little knowledge in python and nodejs.我对python和nodejs知之甚少。 I attempted using python with this code我尝试在此代码中使用 python

 import PyPDF2 try: pdfFileObj = open('test.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageNumber = pdfReader.numPages page = pdfReader.getPage(0) print(pageNumber) pagecontent = page.extractText() print(pagecontent) except Exception as e: print(e)

but I got stuck on how to find the procurement history.但我被困在如何找到采购历史上。 What is the best way to extract the procurement history from the pdf?从pdf中提取采购历史的最佳方法是什么?

I did something similar to scrape my grades a long time ago.很久以前我做过类似的事情来我的成绩。 The easiest (not pretty) solution I found was to convert the pdf to html, then parse the html.我发现的最简单(不漂亮)的解决方案是将 pdf 转换为 html,然后解析 html。

To do so I used pdf2text/pdf2html ( https://pypi.org/project/pdf-tools/ ) and html.为此,我使用了 pdf2text/pdf2html ( https://pypi.org/project/pdf-tools/ ) 和 html。
I also used codecs but don't remember exactly the why behind this.我也使用了编解码器,但不记得这背后的确切原因。

A quick and dirty summary:一个快速而肮脏的总结:

from lxml import html
import codecs
import os

# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
    # Do things
file.close()

Just check the result of pdf2text or pdf2html, then using xpath you should extract your information easily.只需检查 pdf2text 或 pdf2html 的结果,然后使用 xpath 您应该可以轻松提取您的信息。

I hope it helps!我希望它有帮助!

EDIT: comment code编辑:注释代码

EDIT2: The following code is printing your data EDIT2:以下代码正在打印您的数据

# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")

from lxml import html
import codecs
import os

file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
    line_elements = elm.split()
    # Just observed that what you need starts with a number
    if len(line_elements) > 0 and line_elements[0].isdigit():
        print(line_elements)
file.close();

pdfplumber is the best option. pdfplumber是最好的选择。 [ Reference ] [ 参考]

Installation安装

pip install pdfplumber

Extract all the text提取所有文本

import pdfplumber
path = 'path_to_pdf.pdf'
with pdfplumber.open(path) as pdf:
    for  page  in pdf.pages:
        print(page.extract_text())

OK.好的。 I help with the development of this commercial product from opait.com.我帮助开发来自 opait.com 的这个商业产品。 I took your input PDF and zoned a few areas in it like this:我接受了您输入的 PDF 并在其中划分了几个区域,如下所示:

在此处输入图片说明

And also the table you have:还有你的桌子:

在此处输入图片说明

And in about 2 minutes I can extract this from this one and 1000 documents like it.在大约 2 分钟内,我可以从这个和 1000 个类似的文档中提取出来。 Note this image is the log view and exports that data as CSV.请注意,此图像是日志视图并将该数据导出为 CSV。 All the blue "links" are the actual data extracted and actually link back into the PDF so you can see where from.所有蓝色“链接”都是提取的实际数据并实际链接回 PDF,因此您可以查看来自何处。 The output could be XML or JSON or other formats also.输出也可以是 XML 或 JSON 或其他格式。 What you see in that screen capture is the log view, all of that is in CSV format (one for the main properties and others for each table linked by a record ID if you had a PDF that had 1000 of these documents in one PDF).您在该屏幕截图中看到的是日志视图,所有这些都是 CSV 格式(一个用于主要属性,另一个用于通过记录 ID 链接的每个表,如果您的 PDF 在一个 PDF 中包含 1000 个这些文档) .

Again, I help with the development with this product but what you ask for can be done.同样,我帮助开发此产品,但您可以完成您的要求。 I extracted your entire table but also all the other fields that would be important.我提取了您的整个表格以及所有其他重要的字段。

在此处输入图片说明

PDFTron , the company I work for has a fully automated PDF to HTML output solution. PDFTron ,我工作的公司有一个全自动的 PDF 到 HTML 输出解决方案。

You can try it out here online.您可以在此处在线试用。 https://www.pdftron.com/pdf-tools/pdf-table-extraction https://www.pdftron.com/pdf-tools/pdf-table-extraction

Here is the screenshot of the HTML output for the file you provided.这是您提供的文件的 HTML 输出的屏幕截图。 The output contains both HTML tables, and re-flowable text content in between.输出既包含 HTML 表格,也包含两者之间的可重排文本内容。

在此处输入图片说明

The output is standard XML HTML, so you easily parse/manipulate the HTML tables.输出是标准的 XML HTML,因此您可以轻松解析/操作 HTML 表。

I work for the company that makes PDFTables.我为制作 PDFTables 的公司工作。 The PDFTables API would help you to solve this problem, and to convert all PDFs at once. PDFTables API 将帮助您解决这个问题,并一次转换所有 PDF。 It's a simple web based API, so can be called from any programming language.它是一个简单的基于 Web 的 API,因此可以从任何编程语言中调用。 You'll need to create an account at PDFTables.com, then use a script from one of the example languages here: https://pdftables.com/pdf-to-excel-api .您需要在 PDFTables.com 上创建一个帐户,然后使用此处示例语言之一的脚本: https ://pdftables.com/pdf-to-excel-api。 Here's an example using Python:这是一个使用 Python 的示例:

import pdftables_api
import os

c = pdftables_api.Client('MY-API-KEY')

file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"

for file in os.listdir(file_path):
    if file.endswith(".pdf"):
        c.xlsx(os.path.join(file_path,file), file+'.xlsx')

The script looks for all files within a folder that have extension '.pdf', then converts each file to XLSX format.该脚本在文件夹中查找扩展名为“.pdf”的所有文件,然后将每个文件转换为 XLSX 格式。 You can change the format to '.csv', '.html' or '.xml'.您可以将格式更改为“.csv”、“.html”或“.xml”。 The first 75 pages are free.前 75 页是免费的。

That's four lines of script in IntelliGet这是 IntelliGet 中的四行脚本

{ start = IsSubstring("CAGE   Contract Number",Line(-2));  
  end = IsEqual(0, Length(Line(1)));
  { start = 1;
    output = Line(0);
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM