從pdf中提取數據的最佳方法是什么

Question

我有數千個 pdf 文件需要從中提取數據。這是一個示例pdf 。 我想從示例 pdf 中提取此信息。

我對 nodejs、python 或任何其他有效方法持開放態度。 我對python和nodejs知之甚少。 我嘗試在此代碼中使用 python

 import PyPDF2 try: pdfFileObj = open('test.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageNumber = pdfReader.numPages page = pdfReader.getPage(0) print(pageNumber) pagecontent = page.extractText() print(pagecontent) except Exception as e: print(e)

但我被困在如何找到采購歷史上。 從pdf中提取采購歷史的最佳方法是什么？

Answer 1

很久以前我做過類似的事情來刮我的成績。 我發現的最簡單（不漂亮）的解決方案是將 pdf 轉換為 html，然后解析 html。

為此，我使用了 pdf2text/pdf2html ( https://pypi.org/project/pdf-tools/ ) 和 html。
我也使用了編解碼器，但不記得這背后的確切原因。

一個快速而骯臟的總結：

from lxml import html
import codecs
import os

# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
    # Do things
file.close()

只需檢查 pdf2text 或 pdf2html 的結果，然后使用 xpath 您應該可以輕松提取您的信息。

我希望它有幫助！

編輯：注釋代碼

EDIT2：以下代碼正在打印您的數據

# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")

from lxml import html
import codecs
import os

file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
    line_elements = elm.split()
    # Just observed that what you need starts with a number
    if len(line_elements) > 0 and line_elements[0].isdigit():
        print(line_elements)
file.close();

Answer 2

pdfplumber是最好的選擇。 [ 參考]

安裝

pip install pdfplumber

提取所有文本

import pdfplumber
path = 'path_to_pdf.pdf'
with pdfplumber.open(path) as pdf:
    for  page  in pdf.pages:
        print(page.extract_text())

Answer 3

好的。 我幫助開發來自 opait.com 的這個商業產品。 我接受了您輸入的 PDF 並在其中划分了幾個區域，如下所示：

還有你的桌子：

在大約 2 分鍾內，我可以從這個和 1000 個類似的文檔中提取出來。 請注意，此圖像是日志視圖並將該數據導出為 CSV。 所有藍色“鏈接”都是提取的實際數據並實際鏈接回 PDF，因此您可以查看來自何處。 輸出也可以是 XML 或 JSON 或其他格式。 您在該屏幕截圖中看到的是日志視圖，所有這些都是 CSV 格式（一個用於主要屬性，另一個用於通過記錄 ID 鏈接的每個表，如果您的 PDF 在一個 PDF 中包含 1000 個這些文檔） .

同樣，我幫助開發此產品，但您可以完成您的要求。 我提取了您的整個表格以及所有其他重要的字段。

Answer 4

PDFTron ，我工作的公司有一個全自動的 PDF 到 HTML 輸出解決方案。

您可以在此處在線試用。 https://www.pdftron.com/pdf-tools/pdf-table-extraction

這是您提供的文件的 HTML 輸出的屏幕截圖。 輸出既包含 HTML 表格，也包含兩者之間的可重排文本內容。

輸出是標准的 XML HTML，因此您可以輕松解析/操作 HTML 表。

Answer 5

我為制作 PDFTables 的公司工作。 PDFTables API 將幫助您解決這個問題，並一次轉換所有 PDF。 它是一個簡單的基於 Web 的 API，因此可以從任何編程語言中調用。 您需要在 PDFTables.com 上創建一個帳戶，然后使用此處示例語言之一的腳本： https ://pdftables.com/pdf-to-excel-api。 這是一個使用 Python 的示例：

import pdftables_api
import os

c = pdftables_api.Client('MY-API-KEY')

file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"

for file in os.listdir(file_path):
    if file.endswith(".pdf"):
        c.xlsx(os.path.join(file_path,file), file+'.xlsx')

該腳本在文件夾中查找擴展名為“.pdf”的所有文件，然后將每個文件轉換為 XLSX 格式。 您可以將格式更改為“.csv”、“.html”或“.xml”。 前 75 頁是免費的。

Answer 6

這是 IntelliGet 中的四行腳本

{ start = IsSubstring("CAGE   Contract Number",Line(-2));  
  end = IsEqual(0, Length(Line(1)));
  { start = 1;
    output = Line(0);
  }
}

從pdf中提取數據的最佳方法是什么

問題描述

6 個解決方案

解決方案1
2 2019-09-14 22:03:24

解決方案2
2 2021-03-24 16:49:09

解決方案3
0 2019-09-15 01:53:45

解決方案4
0 2019-09-16 22:07:03

解決方案5
0 2019-09-19 09:53:01

解決方案6
0 2021-06-12 14:08:11

從pdf中提取數據的最佳方法是什么

問題描述

6 個解決方案

解決方案1 2 2019-09-14 22:03:24

解決方案2 2 2021-03-24 16:49:09

解決方案3 0 2019-09-15 01:53:45

解決方案4 0 2019-09-16 22:07:03

解決方案5 0 2019-09-19 09:53:01

解決方案6 0 2021-06-12 14:08:11

解決方案1
2 2019-09-14 22:03:24

解決方案2
2 2021-03-24 16:49:09

解決方案3
0 2019-09-15 01:53:45

解決方案4
0 2019-09-16 22:07:03

解決方案5
0 2019-09-19 09:53:01

解決方案6
0 2021-06-12 14:08:11