简体   繁体   中英

How to extract given PDF to text and tables using python and store the data in .csv file?

I need to extract the first table account number, branch name, etc and last table date, description, and amount.

pdf file: https://drive.google.com/file/d/1b537hdTUMQwWSOJHRan6ckHBUDhRBbvX/view?usp=sharing getting blank output using pypdf2 library. camelot giving OSError: Ghostscript is not installed.

import PyPDF2
file_path =open(r"E:\user\programs\28_oct_bank_statement\demo.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file_path)
pageObj = pdf.getPage(0)
print(pageObj.extractText())
import camelot

data = camelot.read_pdf(r"demo.pdf", pages='all')
print(data)

Camelot has dependancies that needs to be install in order to work, such as Ghoscript. You'll fist need to check if that is installed correctly for mac/ubuntu:

from ctypes.util import find_library
find_library("gs")
"libgs.so.9"

for windows:

import ctypes
from ctypes.util import find_library
find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll")))
<name-of-ghostscript-library-on-windows>

otherwise download Ghostscript from the following page https://ghostscript.com/ for windows.I highly suggest reading through the camelot documentation again If you run into more issues.

I usually use the apache tika to do this.

As shown here

You can simply install it and then with a python script:



from tika import parser  
  
parsed_pdf = parser.from_file("sample.pdf")
  
text = parsed_pdf['content']
metadata = parsed_pdf['metadata']
print(data)
  

Note you do need Java installed on the machine for it to run, however it will return the test and then once you have the text you can look to identify a pattern within the text to extract the exact data required.

The nice part about this is it will also return the metadata of the pdf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM