简体   繁体   English

Pandas Dataframe Groupby确定一组与另一组中的值

[英]Pandas Dataframe Groupby Determine Values in 1 group vs. another group

I have a dataframe as follows: 我有一个数据框,如下所示:

Date        ID
2014-12-31  1
2014-12-31  2
2014-12-31  3
2014-12-31  4
2014-12-31  5
2014-12-31  6
2014-12-31  7
2015-01-01  1
2015-01-01  2
2015-01-01  3
2015-01-01  4
2015-01-01  5
2015-01-02  1
2015-01-02  3
2015-01-02  7
2015-01-02  9

What I would like to do is determine the ID(s) on one date that are exclusive to that date versus the values of another date. 我想做的是确定一个日期中唯一的ID,而不是另一个日期的值。

Example1: The result df would be the exclusive ID(s) in 2014-12-31 vs. the ID(s) in 2015-01-01 and the exclusive ID(s) in 2015-01-01 vs. the ID(s) in 2015-01-02: 示例1:结果df是2014-12-31中的专有ID与2015-01-01中的ID,以及2015-01-01中的专有ID与ID( s)在2015-01-02:

   2015-01-01  6 
   2015-01-01  7
   2015-01-02  2
   2015-01-02  4
   2015-01-02  6

I would like to 'choose' how many days 'back' I compare. 我想“选择”我比较多少天。 For instance I can enter a variable daysback=1 and each day would compare to the previous. 例如,我可以输入变量daysback=1并且每天都会与前一天进行比较。 Or I can enter variable daysback=2 and each day would compare to two days ago. 或者我可以输入变量daysback=2而每天将与两天前相比。 etc. 等等

Outside of df.groupby('Date') , I'm not sure where to go with this. df.groupby('Date') ,我不确定该在哪里使用。 Possibly use of diff() ? 可能使用diff()吗?

I'm assuming that the "Date" in your DataFrame is: 1) a date object and 2) not the index. 我假设您的DataFrame中的“日期”是:1)日期对象,2)不是索引。

If those assumptions are wrong, then that changes things a bit. 如果这些假设是错误的,那么情况会有所改变。

import datetime
from datetime import timedelta

def find_unique_ids(df, date, daysback=1):

    date_new = date
    date_old = date - timedelta(days = daysback)

    ids_new = df[df['Date'] == date_new]['ID']
    ids_old = df[df['Date'] == date_old]['ID'] 

    return df.iloc[ids_new[-ids_new.isin(ids_old)]]

date = datetime.date(2015, 1, 2)
daysback = 1

print find_unique_ids(df, date, daysback)

Running that produces the following output: 运行将产生以下输出:

        Date  ID
7 2015-01-01   1
9 2015-01-01   3

If the Date is your Index field, then you need to modify two lines in the function: 如果日期您的索引字段,那么您需要在函数中修改两行:

ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID'] 

Output: 输出:

            ID
Date          
2015-01-01   1
2015-01-01   3

EDIT: 编辑:

This is kind of dirty, but it should accomplish what you want to do. 这有点脏,但是它应该可以完成您想做的事情。 I added comments inline that explain what is going on. 我添加了内联注释来解释发生了什么。 There are probably cleaner and more efficient ways to go about this if this is something that you're going to be running regularly or across massive amounts of data. 如果您要定期运行或跨海量数据运行,则可能会有更清洁,更高效的方法来解决此问题。

def find_unique_ids(df,daysback):

    # We need both Date and ID to both be either fields or index fields -- no mix/match.
    df = df.reset_index() 

    # Calculate DateComp by adding our daysback value as a timedelta
    df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))

    # Join df back on to itself, SQL style LEFT OUTER.
    df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')

    # Create series of missing_id values from the right table
    missing_ids = (df2['Date_y'].isnull())

    # Create series of valid DateComp values. 
    # DateComp is the "future" date that we're comparing against. Without this
    # step, all records on the last Date value will be flagged as unique IDs.
    valid_dates = df2['DateComp_x'].isin(df['Date'].unique())

    # Use those to find missing IDs and valid dates. Create a new output DataFrame.
    output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]

    # Rename columns of output and return
    output.columns = ['Date','ID']
    return output

Test output: 测试输出:

         Date  ID
5  2015-01-01   6
6  2015-01-01   7
8  2015-01-02   2
10 2015-01-02   4
11 2015-01-02   5

EDIT: 编辑:

missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe

Another way by applying list to aggregation, 通过将列表应用于聚合的另一种方式,

df
Out[146]: 
          Date  Unnamed: 2
0   2014-12-31           1
1   2014-12-31           2
2   2014-12-31           3
3   2014-12-31           4
4   2014-12-31           5
5   2014-12-31           6
6   2014-12-31           7
7   2015-01-01           1
8   2015-01-01           2
9   2015-01-01           3
10  2015-01-01           4
11  2015-01-01           5
12  2015-01-02           1
13  2015-01-02           3
14  2015-01-02           7
15  2015-01-02           9

abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)

abbs
Out[142]: 
Date
2014-12-31    [1, 2, 3, 4, 5, 6, 7]
2015-01-01          [1, 2, 3, 4, 5]
2015-01-02             [1, 3, 7, 9]
Name: Unnamed: 2, dtype: object

abbs.loc['2015-01-01']
Out[143]: [1, 2, 3, 4, 5]

list(set(abbs.loc['2014-12-31']) - set(abbs.loc['2015-01-01']))
Out[145]: [6, 7]

In function 在功能上

def uid(df,date1,date2):
    abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)   
    return list(set(abbs.loc[date1]) - set(abbs.loc[date2]))


uid(df,'2015-01-01','2015-01-02')
Out[162]: [2, 4, 5]

You could write a function and use date instead of str :) 您可以编写一个函数并使用date代替str :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM