[英]How to iterate through a folder with multiple excel workbooks containing multiple worksheets create a single data frame?
The following function is what I have come up with to iterate through multiple excel files to store the data into a single data frame.以下函数是我想出的,用于遍历多个 excel 文件以将数据存储到单个数据框中。 However, only the data from the final file is being stored in the final data frame.但是,只有来自最终文件的数据被存储在最终数据框中。 What should I do to get the data from all the files to be stored in the same df and then exported to a csv file?我应该怎么做才能将所有文件中的数据存储在同一个 df 中,然后导出到 csv 文件?
def excel_to_df(folder, start_row, end_row, start_col, end_col):
# loop through all excel files in the folder
for file in os.listdir(folder):
exact_file_path = folder + "\\\\" + file
print(exact_file_path)
# check if file is an excel file
if file.endswith('xlsx'):
# create workbook and its worksheets for each file
wb = openpyxl.load_workbook(exact_file_path)
ws = wb.worksheets
# create a list to store the dataframes
df_list = []
# iterate over the worksheets
for worksheet in ws:
# get the name of the worksheet
name = worksheet.title
# create an empty list to store the values
cell_values = []
# iterate over the rows and columns in the range
for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
min_col = col_min, max_col = col_max):
# append the cell values to the list
cell_values.append([cell.value for cell in row])
# create a dataframe from the cell values and the worksheet name
df = pd.DataFrame(cell_values, columns=range(start_col, end_col+1), index=[name]*len(cell_values))
# append the df to the list
df_list.append(df)
# concatenate the list of dataframes into a single dataframe
df = pd.concat(df_list)
# save the output to a csv file
df.to_csv('test.csv', index=True)
return df
Your immediate problem is that you're creating df_list
inside the loop so that each time the loop starts over, it will overwrite whatever was already in it.您的直接问题是您在循环内创建df_list
以便每次循环重新开始时,它都会覆盖其中已有的内容。 Additionally, your return
is at the end of (and inside) the loop so it doesn't ever get to the second element.此外,您的return
位于循环的末尾(和内部),因此它永远不会到达第二个元素。 When it gets to the return
it gives you what it has and stops running.当它return
时,它会为您提供它所拥有的并停止运行。 You just need to rearrange it, like this:你只需要重新排列它,像这样:
def excel_to_df(folder, start_row, end_row, start_col, end_col):
# create a list to store the dataframes
df_list = []
# loop through all excel files in the folder
for file in os.listdir(folder):
exact_file_path = folder + "\\\\" + file
print(exact_file_path)
# check if file is an excel file
if file.endswith('xlsx'):
# create workbook and its worksheets for each file
wb = openpyxl.load_workbook(exact_file_path)
ws = wb.worksheets
else:
# if the file doesn't end with xlsx then don't try to open it as though it is
next
# iterate over the worksheets
for worksheet in ws:
# get the name of the worksheet
name = worksheet.title
# create an empty list to store the values
cell_values = []
# iterate over the rows and columns in the range
for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
min_col = col_min, max_col = col_max):
# append the cell values to the list
cell_values.append([cell.value for cell in row])
# create a dataframe from the cell values and the worksheet name
df = pd.DataFrame(cell_values, columns=range(start_col, end_col+1), index=[name]*len(cell_values))
# append the df to the list
df_list.append(df)
# concatenate the list of dataframes into a single dataframe
df = pd.concat(df_list)
# save the output to a csv file
df.to_csv('test.csv', index=True)
return df
As an aside, is there a reason you're manually creating a DF instead of just using pd.read_excel
?顺便说一句,您是否有理由手动创建 DF 而不是仅使用pd.read_excel
? If not, I'd recommend getting rid of your for row
loop and just use pd.read_excel(exact_file_path, sheet_name=worksheet.title)
如果没有,我建议摆脱你的for row
循环,只使用pd.read_excel(exact_file_path, sheet_name=worksheet.title)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.