How to extract multiple pandas dataframes from tables in PDF file and store them as CSVs in Python?

Question

I have a cookbook PDF file which consists of various tables that describe about the variables that are used in one of the datasets I am working with. Since the actual data consists of the values that I need to lookup, I will need to create multiple CSV output files from all the tables that are present in this cookbook.

For instance, on page 15 of this PDF file, we have a table as below from which I need to extract pandas dataframe so that I can save it as a CSV file for later use. I do not care about the "Totals" in these tables since I only need the value and the label field.

I tried to solve this problem by using camelot library in Python -

import camelot
# try extracting table from 1 of the pages
tables = camelot.read_pdf('/Users/Downloads/TEDS-A-2018-DS0001-info-codebook_v1.pdf', pages = '12')

# check data
>>> type(tables)
<class 'camelot.core.TableList'>
>>> len(tables)
0

I am not sure why I do not get any tables in the output. Any help is highly appreciated.

Update - I have also tried out the tabula library however I only get odd rows and not even rows from a table. Here is my code for this trial -


pdf_loc = 'csvs/TEDS-A-2018-DS0001-info-codebook_v1.pdf'
list_of_dataframs = tb.read_pdf(input_path=pdf_loc, pages='all')

number_of_dfs = len(list_of_dataframs)

print('first df in list')
list_of_dataframs[0]

Here is the output -

The PDF cookbook can be found here

Answer 1

One can use Tabula with trying few of it's parameters.

As per your case, I have seen that the structure of the table is similar through out the PDF and so we can use column parameter of Tabula to define our own column structure. If we don't describe this parameter, tabula tries to guess the column structure on it's own, and yes it some times fails to identify the right table structure.

tables = tabula.read_pdf(filename, area = (0,0,800,800), pages=15, columns = (95, 410, 490), pandas_options={'header': None})

After using that parameter I am getting below output for page-15 of the PDF:

We can use this for all the pages and of course we can do pre processing also to remove unnecessary rows, so that you get a perfect tabular data. I would love to help further counting this would work for you.

How to extract multiple pandas dataframes from tables in PDF file and store them as CSVs in Python?

Question

1 answers

solution1
0 2021-12-10 06:27:41

How to extract multiple pandas dataframes from tables in PDF file and store them as CSVs in Python?

Question

1 answers

solution1 0 2021-12-10 06:27:41

solution1
0 2021-12-10 06:27:41