如何使用python将给定的PDF提取到文本和表格并将数据存储在.csv文件中？

Question

I need to extract the first table account number, branch name, etc and last table date, description, and amount.我需要提取第一个表帐号，分行名称等和最后一个表日期，描述和金额。

pdf file: https://drive.google.com/file/d/1b537hdTUMQwWSOJHRan6ckHBUDhRBbvX/view?usp=sharing getting blank output using pypdf2 library. pdf 文件： https ://drive.google.com/file/d/1b537hdTUMQwWSOJHRan6ckHBUDhRBbvX/view?usp = sharing 使用 pypdf2 库获取空白输出。 camelot giving OSError: Ghostscript is not installed. camelot 给出 OSError: Ghostscript 未安装。

import PyPDF2
file_path =open(r"E:\user\programs\28_oct_bank_statement\demo.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file_path)
pageObj = pdf.getPage(0)
print(pageObj.extractText())

import camelot

data = camelot.read_pdf(r"demo.pdf", pages='all')
print(data)

Answer 1

Camelot has dependancies that needs to be install in order to work, such as Ghoscript. Camelot 具有需要安装才能工作的依赖项，例如 Ghoscript。 You'll fist need to check if that is installed correctly for mac/ubuntu:您首先需要检查它是否为 mac/ubuntu 正确安装：

from ctypes.util import find_library
find_library("gs")
"libgs.so.9"

for windows:对于窗户：

import ctypes
from ctypes.util import find_library
find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll")))
<name-of-ghostscript-library-on-windows>

otherwise download Ghostscript from the following page https://ghostscript.com/ for windows.I highly suggest reading through the camelot documentation again If you run into more issues.否则，请从以下页面https://ghostscript.com/下载适用于 windows 的 Ghostscript。如果您遇到更多问题，我强烈建议您再次阅读 Camelot 文档。

Answer 2

I usually use the apache tika to do this.我通常使用 apache tika 来做到这一点。

As shown here 如图所示

You can simply install it and then with a python script:您可以简单地安装它，然后使用 python 脚本：



from tika import parser  
  
parsed_pdf = parser.from_file("sample.pdf")
  
text = parsed_pdf['content']
metadata = parsed_pdf['metadata']
print(data)

Note you do need Java installed on the machine for it to run, however it will return the test and then once you have the text you can look to identify a pattern within the text to extract the exact data required.请注意，您确实需要在机器上安装 Java 才能运行它，但是它会返回测试，然后一旦您获得文本，您就可以查找文本中的模式以提取所需的确切数据。

The nice part about this is it will also return the metadata of the pdf关于这个的好处是它还将返回 pdf 的元数据

如何使用python将给定的PDF提取到文本和表格并将数据存储在.csv文件中？

问题描述

2 个解决方案

解决方案1
0 2021-10-28 09:31:30

解决方案2
0 2021-10-28 09:37:39

如何使用python将给定的PDF提取到文本和表格并将数据存储在.csv文件中？

问题描述

2 个解决方案

解决方案1 0 2021-10-28 09:31:30

解决方案2 0 2021-10-28 09:37:39

解决方案1
0 2021-10-28 09:31:30

解决方案2
0 2021-10-28 09:37:39