Pandas DataFrame combine rows by column value, where Date Rows are NULL

Question

Scenerio: Parse the PDF Bank statement and transform into clean and formatted csv file.

What I've tried: I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.

Code:

import camelot
import pandas as pd

tables = camelot.read_pdf('test.pdf', pages = '3')

for i, table in enumerate(tables):
    print(f'table_id:{i}')
    print(f'page:{table.page}')
    print(f'coordinates:{table._bbox}')

tables = camelot.read_pdf('test.pdf', flavor='stream', pages = '3')

columns = df.iloc[0]

df.columns = columns
df = df.drop(0)
df.head()

for c in df.select_dtypes('object').columns:
    df[c] = df[c].str.replace('$', '')
    df[c] = df[c].str.replace('-', '')

def convert_to_float(num):
    try:
        return float(num.replace(',',''))
    except:
        return 0

for col in ['Deposits', 'Withdrawals', 'Balance']:
    df[col] = df[col].map(convert_to_float)

My_Result:

Desired_Output:

The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?

I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.

Answer 1

Using Transform -

df.loc[~df.Date.isna(), 'group'] = 1
g = df.group.fillna(0).cumsum()
df['Description'] = df.groupby(g)['Description'].transform(' '.join)
new_df =  df.loc[~df['Date'].isna()]

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Question

1 answers

solution1
1 2021-05-01 16:37:17

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Question

1 answers

solution1 1 2021-05-01 16:37:17

solution1
1 2021-05-01 16:37:17