简体   繁体   English

从 pdf 中提取数据并将它们添加到 dataframe 的最佳方法

[英]Best way to extract data from pdf and add them to a dataframe

I have a lot of pdfs (same layout) and I want to extract the data from them and add them to a df with 3 columns.我有很多 pdf(相同的布局),我想从中提取数据并将它们添加到具有 3 列的 df 中。 Also, I want the script to run until all the pdfs in the folder are inserted.另外,我希望脚本一直运行,直到插入文件夹中的所有 pdf。 The answers I've found so far aren't useful.到目前为止我找到的答案没有用。

This is a pdf sample.这是 pdf 样品。 I want the amounts in the red shape.我想要红色形状的金额。 pdf样本

The data are department incomes and I want the table to look like this:数据是部门收入,我希望表格如下所示:

Date日期 Department部门 Amount数量
1/7/21 21 年 1 月 7 日 Accomodation AI 13%住宿人工智能 13% 3000 3000
1/7/21 21 年 1 月 7 日 Accomodation HB 13%住宿 HB 13% 1500 1500
1/7/21 21 年 1 月 7 日 Restaurant #2 24%餐厅 #2 24% 2500 2500

If this is for a recurring accounting process I would strongly suggest getting good software (such as Adobe or Simpo PDF Converter) that first converts the pdf to a csv file and then you can use the pandas method read_csv如果这是用于经常性会计流程,我强烈建议您使用好的软件(例如 Adob​​e 或 Simpo PDF Converter),首先将 pdf 转换为 csv 文件,然后您可以使用 pandas 方法read_csv

This is the cleanest approach in my 10+ years of accounting/finance/programming as these pdf files tend to change and then break everything.这是我 10 多年会计/财务/编程工作中最干净的方法,因为这些 pdf 文件往往会发生变化,然后破坏一切。 However, to fully answer your question it can be done with the help of Java and the python module tabula.但是,要完全回答您的问题,可以借助 Java 和 python 模块表格来完成。 To do it this way...要这样做...

  1. Install Java安装 Java
  2. Install the python module tabula: pip install tabula-py安装python模块tabula: pip install tabula-py
  3. Use the below code which will likely require some tweaking使用下面的代码,这可能需要一些调整

import pandas as pd
import os
import glob
import jdk
from tabula import read_pdf
    
# load in all your files
path = '<path where pdf files are>'
pdf_files = glob.glob(os.path.join(path, "*.pdf"))
    
for file in pdf_files:
    # Use the tabula read_pdf, you may need to adjust the encoding and pages
    df = tabula.read_pdf(file, encoding='utf-8', pages='1')
 
    # Let x be a string that we will use to name each dataframe. Here I am starting at the 20th character up to the last 4 (which will be '.pdf') but you will need to adjust this depending on your filepath
    x = file[20:-4]

    # You may need to replace dashes or spaces with '_' or you may not need this at all
    x = x.replace('-', '_')
       
    # Use exec to set the dataframe name (really the reference) so each file you load will have its own df 
    exec(x + '=df')

    # Not necessary but might be helpful until code is optimal
    print(x, df.shape)

} }

And now, you will have a dataframe for each file in your directory.现在,您将拥有目录中每个文件的数据框。 You can then map and transform them, output each one to a csv, or even use pandas concat to stack them all together and output one file.然后,您可以映射和转换它们,将每个输出到 csv,甚至使用 pandas concat 将它们全部堆叠在一起并输出一个文件。

You could also try camelot .你也可以试试camelot

It would look something like它看起来像

cam = camelot.read(file=file_path, flavor='lattice') # (you might need to tweak some additional params, which I am happy to help you with
tables = cam.tables
# let's just assume camelot extracted 1 table
df = tables[0].df
# let's assume that the df looks as expected
column_you_want = list(df['column_title'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM