concatenate all sheets in the excel file, some of which have a different skiprows criteria

Question

I have an Excel workbook with 8 sheets in it. They all follow the same column header structure. The only difference is, the first sheet starts at row 1, but the rest of the sheets start at row 4.

I am trying to run a command like this, but this is giving me the wrong data - and I recognize that because I wrote sheet_name=None this will give me issues as the sheets start at different rows:

df = pd.concat(pd.read_excel(xlsfile, sheet_name=None, skiprows=4), sort=True)

My next attempt was to:

frames = []
df = pd.read_excel(xlsfile, sheet_name='Questionnaire')
for sheet in TREND_SHEETS:
    tmp = pd.read_excel(xlsfile, sheet_name=sheet, skiprows=4)
    # append tmp dynamically to frames, then use concat frames at the end.. ugly
    df.append(tmp, sort=False)

return df

Note, Questionnaire is the first sheet in the Excel workbook. I know the logic here is off, and I do not want to create dynamic variables holding the 'tmp', appending it to a list, and then concatenating the frames.

How can I go about solving this, so that I achieve a dataframe which incorporates all the sheet data?

Answer 1

What I would do is have a config file, like a python dictionary with the sheetnames as keys, and the values can be the number_of_rows to skip:

EDITED: thanks @parfait for the better solution, it is best to concatenate outside of the for loop as its more memory efficient. What you can do it append the dfs to a list within the for loop, then concatenate outside.

import pandas as pd
sheets = {
    'Sheet1': 1,
    'Sheet2': 4,
    'Sheet3': 4,
    'Sheet4': 4
}

list_df = list()
for k, v in sheets.items():
    tmp = pd.read_excel(xlsfile, sheetname=k, skiprows=v)
    list_df.append(tmp)


final_df = pd.concat(list_df, ignore_index=True)

hope this helps!

Answer 2

Consider a list comprehension to build a list of data frames for concatenating once outside the loop. To borrow @Carson's dictionary approach:

sheets = {'sheet1': 1, 'sheet2': 4, 'sheet3': 4, 'sheet4': 4}

df_list = [pd.read_excel(xlsfile, sheetname=k, skiprows=v) \
              for k,v in sheets.items()]

final_df = pd.concat(df_list, ignore_index=True)

concatenate all sheets in the excel file, some of which have a different skiprows criteria

Question

2 answers

solution1
3 2019-02-09 18:15:35

solution2
3 ACCPTED 2019-02-09 19:52:37

concatenate all sheets in the excel file, some of which have a different skiprows criteria

Question

2 answers

solution1 3 2019-02-09 18:15:35

solution2 3 ACCPTED 2019-02-09 19:52:37

solution1
3 2019-02-09 18:15:35

solution2
3 ACCPTED 2019-02-09 19:52:37