从 pdf 中提取数据并将它们添加到 dataframe 的最佳方法

Question

I have a lot of pdfs (same layout) and I want to extract the data from them and add them to a df with 3 columns.我有很多 pdf（相同的布局），我想从中提取数据并将它们添加到具有 3 列的 df 中。 Also, I want the script to run until all the pdfs in the folder are inserted.另外，我希望脚本一直运行，直到插入文件夹中的所有 pdf。 The answers I've found so far aren't useful.到目前为止我找到的答案没有用。

This is a pdf sample.这是 pdf 样品。 I want the amounts in the red shape.我想要红色形状的金额。

The data are department incomes and I want the table to look like this:数据是部门收入，我希望表格如下所示：

Date日期	Department部门	Amount数量
1/7/21 21 年 1 月 7 日	Accomodation AI 13%住宿人工智能 13%	3000 3000
1/7/21 21 年 1 月 7 日	Accomodation HB 13%住宿 HB 13%	1500 1500
1/7/21 21 年 1 月 7 日	Restaurant #2 24%餐厅 #2 24%	2500 2500

Answer 1

If this is for a recurring accounting process I would strongly suggest getting good software (such as Adobe or Simpo PDF Converter) that first converts the pdf to a csv file and then you can use the pandas method read_csv如果这是用于经常性会计流程，我强烈建议您使用好的软件（例如 Adobe 或 Simpo PDF Converter），首先将 pdf 转换为 csv 文件，然后您可以使用 pandas 方法read_csv

This is the cleanest approach in my 10+ years of accounting/finance/programming as these pdf files tend to change and then break everything.这是我 10 多年会计/财务/编程工作中最干净的方法，因为这些 pdf 文件往往会发生变化，然后破坏一切。 However, to fully answer your question it can be done with the help of Java and the python module tabula.但是，要完全回答您的问题，可以借助 Java 和 python 模块表格来完成。 To do it this way...要这样做...

Install Java安装 Java
Install the python module tabula: pip install tabula-py安装python模块tabula： pip install tabula-py
Use the below code which will likely require some tweaking使用下面的代码，这可能需要一些调整

import pandas as pd
import os
import glob
import jdk
from tabula import read_pdf
    
# load in all your files
path = '<path where pdf files are>'
pdf_files = glob.glob(os.path.join(path, "*.pdf"))
    
for file in pdf_files:
    # Use the tabula read_pdf, you may need to adjust the encoding and pages
    df = tabula.read_pdf(file, encoding='utf-8', pages='1')
 
    # Let x be a string that we will use to name each dataframe. Here I am starting at the 20th character up to the last 4 (which will be '.pdf') but you will need to adjust this depending on your filepath
    x = file[20:-4]

    # You may need to replace dashes or spaces with '_' or you may not need this at all
    x = x.replace('-', '_')
       
    # Use exec to set the dataframe name (really the reference) so each file you load will have its own df 
    exec(x + '=df')

    # Not necessary but might be helpful until code is optimal
    print(x, df.shape)

} }

And now, you will have a dataframe for each file in your directory.现在，您将拥有目录中每个文件的数据框。 You can then map and transform them, output each one to a csv, or even use pandas concat to stack them all together and output one file.然后，您可以映射和转换它们，将每个输出到 csv，甚至使用 pandas concat 将它们全部堆叠在一起并输出一个文件。

Answer 2

You could also try camelot .你也可以试试camelot 。

It would look something like它看起来像

cam = camelot.read(file=file_path, flavor='lattice') # (you might need to tweak some additional params, which I am happy to help you with
tables = cam.tables
# let's just assume camelot extracted 1 table
df = tables[0].df
# let's assume that the df looks as expected
column_you_want = list(df['column_title'])

从 pdf 中提取数据并将它们添加到 dataframe 的最佳方法

问题描述

2 个解决方案

解决方案1
0 2022-07-23 19:46:51

解决方案2
0 2022-07-25 22:29:26

从 pdf 中提取数据并将它们添加到 dataframe 的最佳方法

问题描述

2 个解决方案

解决方案1 0 2022-07-23 19:46:51

解决方案2 0 2022-07-25 22:29:26

解决方案1
0 2022-07-23 19:46:51

解决方案2
0 2022-07-25 22:29:26