简体   繁体   English

为多个xlsx文件目录中的每个文件创建具有特定列总和的新表

[英]Create new sheet with sums of specific column for each file in directory of multiple xlsx files

I have many Excel files in a directory with the same structure for each file -- for example the data below could be test1.xlsx : 我的目录中有许多Excel文件,每个文件的结构相同-例如,下面的数据可能是test1.xlsx

Date      Type     Name      Task       Subtask       Hours
3/20/16   Type1    Name1     TaskXyz    SubtaskXYZ    1.00  
3/20/16   Type1    Name2     TaskXyz    SubtaskXYZ    2.00  
3/20/16   Type1    Name3     TaskXyz    SubtaskXYZ    1.00  

What I would like to do is create a new Excel file with the file name and sum of each file in the directory that would look like this: 我想做的是创建一个新的Excel文件,其文件名和目录中每个文件的总和如下所示:

File Name     Sum of hours
Test1.xlsx    4
test2.xlsx    10
...           ...

I just started playing around with glob, and that has been helpful for creating one large dataframe like this: 我刚刚开始使用glob,这对于创建这样的大型数据框很有帮助:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    all_data = all_data.append(df,ignore_index=True)

This has been helpful for creating a dataframe of all the data agnostic of the sheet it came from and I have been able to use groupbys to analyze the data on a macro level but, for all that i know, i cannot sum by sheet put into the data frame only things like: 这对于创建一个与该表所来自的所有数据无关的数据框很有帮助,而且我已经能够使用groupbys在宏级别上分析数据,但是,据我所知,我无法对放入的表进行汇总数据框只有这样的东西:

task_output = all_data.groupby(["Task","Subtask"])["Hours"].agg([np.sum,np.mean])

Where on the whole dataframe i am able to sum and get a mean vs each individual sheet. 在整个数据帧中,我能够求和并得出与每个工作表的平均值。

Any ideas on where to start with this? 关于从哪里开始的任何想法?

While you reading file into memory you should remeber filename you are currently processing: 在将文件读入内存时,应记住当前正在处理的文件名:

all_data = pd.DataFrame()
for f in glob.glob("path/*.xlsx"):
    df = pd.read_excel(f, skiprows=4,index_col=None, na_values=['NA'])
    df['filename'] = f
    all_data = all_data.append(df,ignore_index=True)

task_output = all_data.groupby(['filename', "Task","Subtask"])["Hours"].agg([np.sum,np.mean])   

I would collect all your data frames into one list and then concatenate them in one shot - it should be much faster: 我会将所有数据帧收集到一个列表中,然后一枪将它们连接起来-应该会更快:

import os
import glob
import pandas as pd

def merge_excel_to_df_add_filename(flist, **kwargs):
    dfs = []
    for f in flist:    
        df = pd.read_excel(f, **kwargs)
        df['file'] = f
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

fmask = os.path.join('/path/to/excel/files', '*.xlsx')
df = merge_excel_to_df_add_filename(glob.glob(fmask),
                                    skiprows=4,
                                    index_col=None,
                                    na_values=['NA'])
g = df.groupby('file')['Hours'].agg({'Hours': ['sum','mean']}).reset_index()
# rename columns
g.columns = ['File_Name', 'sum of hours', 'average hours']
# write result to Excel file
g.to_excel('result.xlsx', index=False)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在.xlsx 文件中创建一个新工作表 - Create a new sheet in .xlsx file Python:在选择的目录中创建新文件,并在每个文件中写入特定的目录路径 - Python: Create new files in a directory of choice, and in each file, write specific directory paths into it 遍历目录 for.xlsx 文件,将每个文件中的一张表中的数据附加到 dataframe - Looping through a directory for .xlsx files appending data from one sheet in each file to a dataframe 通过一些数据操作来解析多个xlsx文件以创建一个新文件 - Parsing multiple xlsx files with some data manipulation to create a new file 从每张 .xlsx 文件中读取特定列 - Read Specific Columns From Each Sheet of .xlsx File Python - Excel 列大小调整,将多个工作表导出到 xlsx 文件 - Python - Excel Column sizing with multiple sheet export to xlsx files 在python目录中为每个.pdf文件创建一个新的.txt文件 - Create a new .txt file for each .pdf files in a directory in python 将目录文件中的特定列提取到新文件中 - Extract specific column from files of a directory into a new file 尝试在 python 中创建一个汇总多个条件的新列 - Trying to create a new column that sums multiple criteria in python 数据丢失,使用熊猫在具有多个工作表的.xlsx文件中更改格式,同时在现有的.xlsx文件中添加新工作表时使用openpyxl - Data missing, format changed in .xlsx file having multiple sheets using pandas, openpyxl while adding new sheet in existing .xlsx file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM