简体   繁体   中英

concatenate all sheets in the excel file, some of which have a different skiprows criteria

I have an Excel workbook with 8 sheets in it. They all follow the same column header structure. The only difference is, the first sheet starts at row 1, but the rest of the sheets start at row 4.

I am trying to run a command like this, but this is giving me the wrong data - and I recognize that because I wrote sheet_name=None this will give me issues as the sheets start at different rows:

df = pd.concat(pd.read_excel(xlsfile, sheet_name=None, skiprows=4), sort=True)

My next attempt was to:

frames = []
df = pd.read_excel(xlsfile, sheet_name='Questionnaire')
for sheet in TREND_SHEETS:
    tmp = pd.read_excel(xlsfile, sheet_name=sheet, skiprows=4)
    # append tmp dynamically to frames, then use concat frames at the end.. ugly
    df.append(tmp, sort=False)

return df

Note, Questionnaire is the first sheet in the Excel workbook. I know the logic here is off, and I do not want to create dynamic variables holding the 'tmp', appending it to a list, and then concatenating the frames.

How can I go about solving this, so that I achieve a dataframe which incorporates all the sheet data?

What I would do is have a config file, like a python dictionary with the sheetnames as keys, and the values can be the number_of_rows to skip:

EDITED: thanks @parfait for the better solution, it is best to concatenate outside of the for loop as its more memory efficient. What you can do it append the dfs to a list within the for loop, then concatenate outside.

import pandas as pd
sheets = {
    'Sheet1': 1,
    'Sheet2': 4,
    'Sheet3': 4,
    'Sheet4': 4
}

list_df = list()
for k, v in sheets.items():
    tmp = pd.read_excel(xlsfile, sheetname=k, skiprows=v)
    list_df.append(tmp)


final_df = pd.concat(list_df, ignore_index=True)

hope this helps!

Consider a list comprehension to build a list of data frames for concatenating once outside the loop. To borrow @Carson's dictionary approach:

sheets = {'sheet1': 1, 'sheet2': 4, 'sheet3': 4, 'sheet4': 4}

df_list = [pd.read_excel(xlsfile, sheetname=k, skiprows=v) \
              for k,v in sheets.items()]

final_df = pd.concat(df_list, ignore_index=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM