[英]Create new sheet with sums of specific column for each file in directory of multiple xlsx files
I have many Excel files in a directory with the same structure for each file -- for example the data below could be test1.xlsx : 我的目录中有许多Excel文件,每个文件的结构相同-例如,下面的数据可能是test1.xlsx :
Date Type Name Task Subtask Hours
3/20/16 Type1 Name1 TaskXyz SubtaskXYZ 1.00
3/20/16 Type1 Name2 TaskXyz SubtaskXYZ 2.00
3/20/16 Type1 Name3 TaskXyz SubtaskXYZ 1.00
What I would like to do is create a new Excel file with the file name and sum of each file in the directory that would look like this: 我想做的是创建一个新的Excel文件,其文件名和目录中每个文件的总和如下所示:
File Name Sum of hours
Test1.xlsx 4
test2.xlsx 10
... ...
I just started playing around with glob, and that has been helpful for creating one large dataframe like this: 我刚刚开始使用glob,这对于创建这样的大型数据框很有帮助:
all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
all_data = all_data.append(df,ignore_index=True)
This has been helpful for creating a dataframe of all the data agnostic of the sheet it came from and I have been able to use groupbys to analyze the data on a macro level but, for all that i know, i cannot sum by sheet put into the data frame only things like: 这对于创建一个与该表所来自的所有数据无关的数据框很有帮助,而且我已经能够使用groupbys在宏级别上分析数据,但是,据我所知,我无法对放入的表进行汇总数据框只有这样的东西:
task_output = all_data.groupby(["Task","Subtask"])["Hours"].agg([np.sum,np.mean])
Where on the whole dataframe i am able to sum and get a mean vs each individual sheet. 在整个数据帧中,我能够求和并得出与每个工作表的平均值。
Any ideas on where to start with this? 关于从哪里开始的任何想法?
While you reading file into memory you should remeber filename you are currently processing: 在将文件读入内存时,应记住当前正在处理的文件名:
all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
df['filename'] = f
all_data = all_data.append(df,ignore_index=True)
task_output = all_data.groupby(['filename', "Task","Subtask"])["Hours"].agg([np.sum,np.mean])
I would collect all your data frames into one list and then concatenate them in one shot - it should be much faster: 我会将所有数据帧收集到一个列表中,然后一枪将它们连接起来-应该会更快:
import os
import glob
import pandas as pd
def merge_excel_to_df_add_filename(flist, **kwargs):
dfs = []
for f in flist:
df = pd.read_excel(f, **kwargs)
df['file'] = f
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
fmask = os.path.join('/path/to/excel/files', '*.xlsx')
df = merge_excel_to_df_add_filename(glob.glob(fmask),
skiprows=4,
index_col=None,
na_values=['NA'])
g = df.groupby('file')['Hours'].agg({'Hours': ['sum','mean']}).reset_index()
# rename columns
g.columns = ['File_Name', 'sum of hours', 'average hours']
# write result to Excel file
g.to_excel('result.xlsx', index=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.