简体   繁体   中英

PDF table to pandas data frame using camelot

I'm trying to create a simple way to get data from pdf into a pandas data frame. Something like that:

import camelot
import pandas as pd

pdf = camelot.read_pdf("file1.pdf")

print(pdf[0].df)

The point is that I'm trying with two different files: File 1 and File 2 but for the second file I'm not able to get the info. It has more columns but I believe it shouldn't be a problem.

Also, the only way I could get a table from file 2 was using flavor="stream"

Result for File 1

Result for File 2

To correctly extract tables from the second file, it is necessary to process background lines, using the appropriate parameter (process_background) for lattice method, as you can see in the following code:

import camelot

tables=camelot.read_pdf('file2.pdf', process_background=True)

for table in tables:
    print(table.df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM