如何遍历包含多个工作表的多个 Excel 工作簿的文件夹创建单个数据框？

Question

The following function is what I have come up with to iterate through multiple excel files to store the data into a single data frame.以下函数是我想出的，用于遍历多个 excel 文件以将数据存储到单个数据框中。 However, only the data from the final file is being stored in the final data frame.但是，只有来自最终文件的数据被存储在最终数据框中。 What should I do to get the data from all the files to be stored in the same df and then exported to a csv file?我应该怎么做才能将所有文件中的数据存储在同一个 df 中，然后导出到 csv 文件？

def excel_to_df(folder, start_row, end_row, start_col, end_col):
    # loop through all excel files in the folder
    for file in os.listdir(folder):
        exact_file_path = folder +  "\\\\" + file 
        print(exact_file_path)
        # check if file is an excel file
        if file.endswith('xlsx'):
            # create workbook and its worksheets for each file
            wb = openpyxl.load_workbook(exact_file_path)
            ws = wb.worksheets

        # create a list to store the dataframes
        df_list = []

        # iterate over the worksheets
        for worksheet in ws:
            # get the name of the worksheet
            name = worksheet.title
            # create an empty list to store the values
            cell_values = []

            # iterate over the rows and columns in the range

            for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
                                            min_col = col_min, max_col = col_max):


                                            # append the cell values to the list
                                            cell_values.append([cell.value for cell in row])

                                            # create a dataframe from the cell values and the worksheet name
                                            df = pd.DataFrame(cell_values, columns=range(start_col, end_col+1), index=[name]*len(cell_values))

                                            # append the df to the list
                                            df_list.append(df)


                                            # concatenate the list of dataframes into a single dataframe
                                            df = pd.concat(df_list)
                                            # save the output to a csv file
                                            df.to_csv('test.csv', index=True)

                                            return df

Answer 1

Your immediate problem is that you're creating df_list inside the loop so that each time the loop starts over, it will overwrite whatever was already in it.您的直接问题是您在循环内创建df_list以便每次循环重新开始时，它都会覆盖其中已有的内容。 Additionally, your return is at the end of (and inside) the loop so it doesn't ever get to the second element.此外，您的return位于循环的末尾（和内部），因此它永远不会到达第二个元素。 When it gets to the return it gives you what it has and stops running.当它return时，它会为您提供它所拥有的并停止运行。 You just need to rearrange it, like this:你只需要重新排列它，像这样：

def excel_to_df(folder, start_row, end_row, start_col, end_col):
    
    # create a list to store the dataframes
    df_list = []
    # loop through all excel files in the folder
    for file in os.listdir(folder):
        exact_file_path = folder +  "\\\\" + file 
        print(exact_file_path)
        # check if file is an excel file
        if file.endswith('xlsx'):
            # create workbook and its worksheets for each file
            wb = openpyxl.load_workbook(exact_file_path)
            ws = wb.worksheets
        else:
            # if the file doesn't end with xlsx then don't try to open it as though it is
            next
        # iterate over the worksheets
        for worksheet in ws:
            # get the name of the worksheet
            name = worksheet.title
            # create an empty list to store the values
            cell_values = []

            # iterate over the rows and columns in the range

            for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
                                            min_col = col_min, max_col = col_max):


                                            # append the cell values to the list
                                            cell_values.append([cell.value for cell in row])

            # create a dataframe from the cell values and the worksheet name
            df = pd.DataFrame(cell_values, columns=range(start_col, end_col+1), index=[name]*len(cell_values))

            # append the df to the list
            df_list.append(df)


    # concatenate the list of dataframes into a single dataframe
    df = pd.concat(df_list)
    # save the output to a csv file
    df.to_csv('test.csv', index=True)

    return df

As an aside, is there a reason you're manually creating a DF instead of just using pd.read_excel ?顺便说一句，您是否有理由手动创建 DF 而不是仅使用pd.read_excel ？ If not, I'd recommend getting rid of your for row loop and just use pd.read_excel(exact_file_path, sheet_name=worksheet.title)如果没有，我建议摆脱你的for row循环，只使用pd.read_excel(exact_file_path, sheet_name=worksheet.title)

如何遍历包含多个工作表的多个 Excel 工作簿的文件夹创建单个数据框？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-22 11:27:15

如何遍历包含多个工作表的多个 Excel 工作簿的文件夹创建单个数据框？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-22 11:27:15

解决方案1
1 已采纳 2022-12-22 11:27:15