[英]Best way to extract data from pdf and add them to a dataframe
I have a lot of pdfs (same layout) and I want to extract the data from them and add them to a df with 3 columns.我有很多 pdf(相同的布局),我想从中提取数据并将它们添加到具有 3 列的 df 中。 Also, I want the script to run until all the pdfs in the folder are inserted.
另外,我希望脚本一直运行,直到插入文件夹中的所有 pdf。 The answers I've found so far aren't useful.
到目前为止我找到的答案没有用。
This is a pdf sample.这是 pdf 样品。 I want the amounts in the red shape.
我想要红色形状的金额。
The data are department incomes and I want the table to look like this:数据是部门收入,我希望表格如下所示:
Date![]() |
Department![]() |
Amount![]() |
---|---|---|
1/7/21 ![]() |
Accomodation AI 13%![]() |
3000 ![]() |
1/7/21 ![]() |
Accomodation HB 13%![]() |
1500 ![]() |
1/7/21 ![]() |
Restaurant #2 24%![]() |
2500 ![]() |
If this is for a recurring accounting process I would strongly suggest getting good software (such as Adobe or Simpo PDF Converter) that first converts the pdf to a csv file and then you can use the pandas method read_csv如果这是用于经常性会计流程,我强烈建议您使用好的软件(例如 Adobe 或 Simpo PDF Converter),首先将 pdf 转换为 csv 文件,然后您可以使用 pandas 方法read_csv
This is the cleanest approach in my 10+ years of accounting/finance/programming as these pdf files tend to change and then break everything.这是我 10 多年会计/财务/编程工作中最干净的方法,因为这些 pdf 文件往往会发生变化,然后破坏一切。 However, to fully answer your question it can be done with the help of Java and the python module tabula.
但是,要完全回答您的问题,可以借助 Java 和 python 模块表格来完成。 To do it this way...
要这样做...
import pandas as pd
import os
import glob
import jdk
from tabula import read_pdf
# load in all your files
path = '<path where pdf files are>'
pdf_files = glob.glob(os.path.join(path, "*.pdf"))
for file in pdf_files:
# Use the tabula read_pdf, you may need to adjust the encoding and pages
df = tabula.read_pdf(file, encoding='utf-8', pages='1')
# Let x be a string that we will use to name each dataframe. Here I am starting at the 20th character up to the last 4 (which will be '.pdf') but you will need to adjust this depending on your filepath
x = file[20:-4]
# You may need to replace dashes or spaces with '_' or you may not need this at all
x = x.replace('-', '_')
# Use exec to set the dataframe name (really the reference) so each file you load will have its own df
exec(x + '=df')
# Not necessary but might be helpful until code is optimal
print(x, df.shape)
} }
And now, you will have a dataframe for each file in your directory.现在,您将拥有目录中每个文件的数据框。 You can then map and transform them, output each one to a csv, or even use pandas concat to stack them all together and output one file.
然后,您可以映射和转换它们,将每个输出到 csv,甚至使用 pandas concat 将它们全部堆叠在一起并输出一个文件。
You could also try camelot
.你也可以试试
camelot
。
It would look something like它看起来像
cam = camelot.read(file=file_path, flavor='lattice') # (you might need to tweak some additional params, which I am happy to help you with
tables = cam.tables
# let's just assume camelot extracted 1 table
df = tables[0].df
# let's assume that the df looks as expected
column_you_want = list(df['column_title'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.