熊猫分组，汇总两列并返回一列的最早开始日期

Question

I am trying to group by a csv file in Pandas (by one column: ID) in order to get the earliest Start Date and latest End Date. 我正在尝试按Pandas中的csv文件分组（按一列：ID），以获取最早的开始日期和最新的结束日期。 Then I am trying to group by multiple columns in order to get the SUM of a value. 然后，我尝试按多列分组以获取值的总和。 For each ID in the second groupedby dataframe, I want to present the dates. 对于第二个分组数据框中的每个ID，我想显示日期。

I am loading a csv in order to group and aggregate data. 我正在加载一个csv，以便对数据进行分组和汇总。

01) First I load the csv 01）首先我加载csv

def get_csv():
        #Read csv file
        df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",parse_dates=['Start Date', 'End Date'])

        return df

02) Group and aggregate the data for the columns (ID and Site) 02）分组并汇总列（ID和站点）的数据

def do_stuff():
     df = get_csv()   
     groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})

which works as expected and I am getting the following (example): 它按预期工作，我得到以下信息（示例）：

03) And ideally, for the same ID I want to present the earliest date in the Start Date column and the latest one in the End Date column. 03）理想情况下，对于相同的ID，我想在“开始日期”列中显示最早的日期，在“结束日期”列中显示最新的日期。 The aggregation for the value works perfectly. 该值的汇总工作完美。 What I want to get is the following: 我想得到的是以下内容：

I do not know how to change my current code above. 我不知道如何更改上面的当前代码。 I have tried this so far: 到目前为止，我已经尝试过了：

def do_stuff():
    df = get_csv()
    md = get_csv()

    minStart = md[md['A or B'].str.contains('AAAA')].groupby([md['ID']]).agg({'Start Date': 'min'})

    df['earliestStartDate'] = minStart

    groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})

which fails and also tried changing the above to: 失败，并且还尝试将以上内容更改为：

def do_stuff():
    df = get_csv()
    md = get_csv()

    df['earliestStartDate'] = md.loc[ md['ID'] == df['ID'], 'Start Date'].min()

    groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})

Ideally, I will just change something in the groupedBy instead of having to read the csv twice and aggregate the data twice. 理想情况下，我只需要更改groupedBy中的内容，而不必两次读取csv并将数据聚合两次。 Is that possible? 那可能吗？ If not, what can I change to make the script work? 如果没有，我可以进行哪些更改以使脚本正常工作？ I am trying to test random things to get more experience in Pandas and Python. 我正在尝试测试随机事物，以获取更多有关Pandas和Python的经验。

I am guessing I have to create two dataframes here. 我猜我必须在这里创建两个dataframes 。 One to get the groupedby data for all the columns needed (and the SUM of the Value). 一种用于获取所需的所有列的groupedby数据（以及值的groupedby ）。 A second one to get the earliest Start Date and latest End Date for each ID. 第二个获取每个ID的最早开始日期和最新结束日期。 Then I need to find a way to concatenate the two dataframes . 然后，我需要找到一种方法来串联两个dataframes 。 Is that a good result or do you think that there is an easier way to achieve that? 这是一个好的结果，还是您认为有一种更简单的方法可以实现这一目标？

UPD: My code where I have created two dataframes (not sure whether this is the right solution) is given below: UPD：我在下面创建两个数据框的代码（不确定这是否是正确的解决方案）如下：

#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
md = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])


#Calculate the Clean Value
df['Clean Cost'] = (df['Value'] - df['Value2']) #.apply(lambda x: round(x,0))

#Get the min/max Dates
minMaxDates = md[md['Random'].str.contains('Y')].groupby([md['ID']]).agg({'Start Date': 'min', 'End Date': 'max'})

#Group by and aggregate (return Earliest Start Date, Latest End Date and SUM of the Values)
groupedBy = df[df['Random'].str.contains('Y')].groupby([df['ID'], df['Site'].fillna('Other')]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum', 'Value2': 'sum', 'Clean Cost': 'sum'})

and if I print the two dataframes, I am getting the following: 如果我打印两个数据框，则会得到以下信息：

and 和

If I print the df.head(), I am getting the following: 如果我打印df.head（），将得到以下信息：

  ID A or B Start Date   End Date  Value  Site  Value2 Random                                                 alse.

0 45221 AAAA 2017-12-30 2017-09-30 14 S111 7 Y 1 45221 AAAA 2017-01-15 2017-09-30 15 S222 7 Y 2 85293 BBBB 2017-05-12 2017-07-24 29 S111 3 Y 3 85293 AAAA 2017-03-22 2017-10-14 32 S222 4 Y 4 45221 AAAA 2017-01-15 2017-09-30 30 S222 7 Y 0 45221 AAAA 2017-12-30 2017-09-30 14 S111 7是1 45221 AAAA 2017-01-15 2017-09-30 15 S222 7是2 85293 BBBB 2017-05-12 2017-07-24 29 S111 3是3 85293 AAAA 2017-03-22 2017-10-14 32 S222 4是4 45221 AAAA 2017-01-15 2017-09-30 30 S222 7是

A link of the file is given here: LINK 此处提供文件的链接： LINK

Answer 1

I think you need transform : 我认为您需要transform ：

df = pd.read_csv('sampleBionic.csv')
print (df)
      ID A or B  Start Date    End Date  Value  Site  Value2 Random
0  45221   AAAA  12/30/2017  09/30/2017     14  S111       7      Y
1  45221   AAAA  01/15/2017  09/30/2017     15  S222       7      Y
2  85293   BBBB  05/12/2017  07/24/2017     29  S111       3      Y
3  85293   AAAA  03/22/2017  10/14/2017     32  S222       4      Y
4  45221   AAAA  01/15/2017  09/30/2017     30  S222       7      Y

groupedBy = (df[df['A or B'].str.contains('AAAA')]
                            .groupby([df['ID'], df['Site'].fillna('Other'),])
                            .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'}))
print (groupedBy)    
            Start Date    End Date  Value
ID    Site                               
45221 S111  12/30/2017  09/30/2017     14
      S222  01/15/2017  09/30/2017     45
85293 S222  03/22/2017  10/14/2017     32

g = groupedBy.groupby(level=0)              
groupedBy['Start Date'] = g['Start Date'].transform('min') 
groupedBy['End Date'] = g['End Date'].transform('max')
print (groupedBy)
            Start Date    End Date  Value
ID    Site                               
45221 S111  01/15/2017  09/30/2017     14
      S222  01/15/2017  09/30/2017     45
85293 S222  03/22/2017  10/14/2017     32

Answer 2

I have managed to create a script that does what I want. 我设法创建了一个可以满足我需求的脚本。 I will paste the answer in case somebody needs it in the future. 如果将来有人需要，我会粘贴答案。 Jezrael's answer worked fine too. Jezrael的回答也很好。 So, considering that the original csv is like this: 因此，考虑到原始的csv是这样的：

my sript is: 我的便笺是：

import pandas as pd
import os
import csv
import time
import dateutil.parser as dparser
import datetime


def get_csv():
        #Read csv file
        df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
        df = df[df['A or B'].str.contains('AAAA')]

        return df

def do_stuff():
    df = get_csv()

    #Get the min Start Date, max End date, sum of the Value and Value2 and calculate the Net Cost
    varA      = 'ID';
    dfGrouped = df.groupby(varA, as_index=False).agg({'Start Date': 'min', 'End Date': 'max'}).copy();

    varsToKeep = ['ID', 'Site', 'Random', 'Start Date_grp', 'End Date_grp', 'Value', 'Value2', ];
    dfTemp = pd.merge(df, dfGrouped, how='inner', on='ID', suffixes=(' ', '_grp'), copy=True)[varsToKeep];

    dfBreakDown = dfTemp.groupby(['ID', 'Site', 'Random', 'Start Date_grp',
        'End Date_grp']).sum()

    #Calculate the Net Cost
    dfTemp['Net Cost'] = (dfTemp['Value'] - dfTemp['Value2'])

    groupedBy = dfTemp.groupby(['ID', 'Site', 'Random']).agg({'Start Date_grp': 'min', 'End Date_grp': 'max', 'Value': 'sum', 'Value2': 'sum', 'Net Cost': 'sum'})

    csvoutput(groupedBy)

def csvoutput(df):
        #Csv output
        df.to_csv(path_or_buf='OUT.csv', sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')

if __name__ == "__main__":
        #  start things here
        do_stuff()

熊猫分组，汇总两列并返回一列的最早开始日期

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-11-02 10:00:06

解决方案2
0 2017-11-02 12:09:08

熊猫分组，汇总两列并返回一列的最早开始日期

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-11-02 10:00:06

解决方案2 0 2017-11-02 12:09:08

解决方案1
2 已采纳 2017-11-02 10:00:06

解决方案2
0 2017-11-02 12:09:08