为多个xlsx文件目录中的每个文件创建具有特定列总和的新表

Question

I have many Excel files in a directory with the same structure for each file -- for example the data below could be test1.xlsx : 我的目录中有许多Excel文件，每个文件的结构相同-例如，下面的数据可能是test1.xlsx ：

Date      Type     Name      Task       Subtask       Hours
3/20/16   Type1    Name1     TaskXyz    SubtaskXYZ    1.00  
3/20/16   Type1    Name2     TaskXyz    SubtaskXYZ    2.00  
3/20/16   Type1    Name3     TaskXyz    SubtaskXYZ    1.00

What I would like to do is create a new Excel file with the file name and sum of each file in the directory that would look like this: 我想做的是创建一个新的Excel文件，其文件名和目录中每个文件的总和如下所示：

File Name     Sum of hours
Test1.xlsx    4
test2.xlsx    10
...           ...

I just started playing around with glob, and that has been helpful for creating one large dataframe like this: 我刚刚开始使用glob，这对于创建这样的大型数据框很有帮助：

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    all_data = all_data.append(df,ignore_index=True)

This has been helpful for creating a dataframe of all the data agnostic of the sheet it came from and I have been able to use groupbys to analyze the data on a macro level but, for all that i know, i cannot sum by sheet put into the data frame only things like: 这对于创建一个与该表所来自的所有数据无关的数据框很有帮助，而且我已经能够使用groupbys在宏级别上分析数据，但是，据我所知，我无法对放入的表进行汇总数据框只有这样的东西：

task_output = all_data.groupby(["Task","Subtask"])["Hours"].agg([np.sum,np.mean])

Where on the whole dataframe i am able to sum and get a mean vs each individual sheet. 在整个数据帧中，我能够求和并得出与每个工作表的平均值。

Any ideas on where to start with this? 关于从哪里开始的任何想法？

Answer 1

While you reading file into memory you should remeber filename you are currently processing: 在将文件读入内存时，应记住当前正在处理的文件名：

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    df['filename'] = f
    all_data = all_data.append(df,ignore_index=True)

task_output = all_data.groupby(['filename', "Task","Subtask"])["Hours"].agg([np.sum,np.mean])

Answer 2

I would collect all your data frames into one list and then concatenate them in one shot - it should be much faster: 我会将所有数据帧收集到一个列表中，然后一枪将它们连接起来-应该会更快：

import os
import glob
import pandas as pd

def merge_excel_to_df_add_filename(flist, **kwargs):
    dfs = []
    for f in flist:    
        df = pd.read_excel(f, **kwargs)
        df['file'] = f
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

fmask = os.path.join('/path/to/excel/files', '*.xlsx')
df = merge_excel_to_df_add_filename(glob.glob(fmask),
                                    skiprows=4,
                                    index_col=None,
                                    na_values=['NA'])
g = df.groupby('file')['Hours'].agg({'Hours': ['sum','mean']}).reset_index()
# rename columns
g.columns = ['File_Name', 'sum of hours', 'average hours']
# write result to Excel file
g.to_excel('result.xlsx', index=False)

为多个xlsx文件目录中的每个文件创建具有特定列总和的新表

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-03-20 22:58:34

解决方案2
1 2016-03-20 23:10:14

为多个xlsx文件目录中的每个文件创建具有特定列总和的新表

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-03-20 22:58:34

解决方案2 1 2016-03-20 23:10:14

解决方案1
1 已采纳 2016-03-20 22:58:34

解决方案2
1 2016-03-20 23:10:14