简体   繁体   中英

How to extract multiple pandas dataframes from tables in PDF file and store them as CSVs in Python?

I have a cookbook PDF file which consists of various tables that describe about the variables that are used in one of the datasets I am working with. Since the actual data consists of the values that I need to lookup, I will need to create multiple CSV output files from all the tables that are present in this cookbook.

For instance, on page 15 of this PDF file, we have a table as below from which I need to extract pandas dataframe so that I can save it as a CSV file for later use. I do not care about the "Totals" in these tables since I only need the value and the label field. 在此处输入图像描述

I tried to solve this problem by using camelot library in Python -

import camelot
# try extracting table from 1 of the pages
tables = camelot.read_pdf('/Users/Downloads/TEDS-A-2018-DS0001-info-codebook_v1.pdf', pages = '12')

# check data
>>> type(tables)
<class 'camelot.core.TableList'>
>>> len(tables)
0

I am not sure why I do not get any tables in the output. Any help is highly appreciated.

Update - I have also tried out the tabula library however I only get odd rows and not even rows from a table. Here is my code for this trial -


pdf_loc = 'csvs/TEDS-A-2018-DS0001-info-codebook_v1.pdf'
list_of_dataframs = tb.read_pdf(input_path=pdf_loc, pages='all')

number_of_dfs = len(list_of_dataframs)

print('first df in list')
list_of_dataframs[0]

Here is the output - 在此处输入图像描述

The PDF cookbook can be found here

One can use Tabula with trying few of it's parameters.

As per your case, I have seen that the structure of the table is similar through out the PDF and so we can use column parameter of Tabula to define our own column structure. If we don't describe this parameter, tabula tries to guess the column structure on it's own, and yes it some times fails to identify the right table structure.

tables = tabula.read_pdf(filename, area = (0,0,800,800), pages=15, columns = (95, 410, 490), pandas_options={'header': None})

After using that parameter I am getting below output for page-15 of the PDF: 输出

We can use this for all the pages and of course we can do pre processing also to remove unnecessary rows, so that you get a perfect tabular data. I would love to help further counting this would work for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM