简体   繁体   English

如何使用python将给定的PDF提取到文本和表格并将数据存储在.csv文件中?

[英]How to extract given PDF to text and tables using python and store the data in .csv file?

I need to extract the first table account number, branch name, etc and last table date, description, and amount.我需要提取第一个表帐号,分行名称等和最后一个表日期,描述和金额。

pdf file: https://drive.google.com/file/d/1b537hdTUMQwWSOJHRan6ckHBUDhRBbvX/view?usp=sharing getting blank output using pypdf2 library. pdf 文件: https ://drive.google.com/file/d/1b537hdTUMQwWSOJHRan6ckHBUDhRBbvX/view?usp = sharing 使用 pypdf2 库获取空白输出。 camelot giving OSError: Ghostscript is not installed. camelot 给出 OSError: Ghostscript 未安装。

import PyPDF2
file_path =open(r"E:\user\programs\28_oct_bank_statement\demo.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file_path)
pageObj = pdf.getPage(0)
print(pageObj.extractText())
import camelot

data = camelot.read_pdf(r"demo.pdf", pages='all')
print(data)

Camelot has dependancies that needs to be install in order to work, such as Ghoscript. Camelot 具有需要安装才能工作的依赖项,例如 Ghoscript。 You'll fist need to check if that is installed correctly for mac/ubuntu:您首先需要检查它是否为 mac/ubuntu 正确安装:

from ctypes.util import find_library
find_library("gs")
"libgs.so.9"

for windows:对于窗户:

import ctypes
from ctypes.util import find_library
find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll")))
<name-of-ghostscript-library-on-windows>

otherwise download Ghostscript from the following page https://ghostscript.com/ for windows.I highly suggest reading through the camelot documentation again If you run into more issues.否则,请从以下页面https://ghostscript.com/下载适用于 windows 的 Ghostscript。如果您遇到更多问题,我强烈建议您再次阅读 Camelot 文档。

I usually use the apache tika to do this.我通常使用 apache tika 来做到这一点。

As shown here 如图所示

You can simply install it and then with a python script:您可以简单地安装它,然后使用 python 脚本:



from tika import parser  
  
parsed_pdf = parser.from_file("sample.pdf")
  
text = parsed_pdf['content']
metadata = parsed_pdf['metadata']
print(data)
  

Note you do need Java installed on the machine for it to run, however it will return the test and then once you have the text you can look to identify a pattern within the text to extract the exact data required.请注意,您确实需要在机器上安装 Java 才能运行它,但是它会返回测试,然后一旦您获得文本,您就可以查找文本中的模式以提取所需的确切数据。

The nice part about this is it will also return the metadata of the pdf关于这个的好处是它还将返回 pdf 的元数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 从 pdf 中的表格中提取数据? - How to extract data from tables in a pdf using Python? 如何从 PDF 文件中的表中提取多个 pandas 数据帧并将它们作为 CSV 存储在 Python 中? - How to extract multiple pandas dataframes from tables in PDF file and store them as CSVs in Python? 如何使用python仅从PDF文件中提取特定文本 - How to extract only specific text from PDF file using python 如何使用python提取pdf文件每一行中的文本 - how to extract text in each line of a pdf file using python 如何使用 python 从文本文件中提取特定数据并写入 CSV - How to extract specific data from a text file and write into CSV using python Python-CSV,可将文本文件中的数据提取到CSV文件中 - Python - CSV to extract data from text file into a CSV file 如何从Python中提取PDF文件中的文本? - How to extract text from a PDF file in Python? 如何从.csv文件中以python中的.json文件中给定格式提取数据? - How to extract data from .csv file with a given format in .json file in python? 如何使用Python从html表中通过Web抓取数据并将其存储在csv文件中。 我可以提取某些部分,但不能提取其他部分 - How to web scrape data using Python from an html table and store it in a csv file. I am able to extract some parts but not the others 使用 Python Xpath 将数据提取到 csv 文件 - Using Python Xpath to extract data to a csv file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM