简体   繁体   English

根据熊猫中另一列中相似值的分组来创建新列

[英]Create a new column based on Grouping of similar values in another column in pandas

Hi I have an event data frame with datetimes and event ids and sensor ids. 嗨,我有一个事件数据框,其中包含日期时间,事件ID和传感器ID。 I would like to group events that happen within one hour per sensor and if possible tag them with the group count. 我想将每个传感器在一小时内发生的事件分组,并在可能的情况下用分组计数对其进行标记。 Original Data Frame 原始数据框

         sensor_id  event_id   time  
    0    A         e1            2017-02-14 05:30:00      
    1    A         e2            2017-02-14 05:45:00 
    2    A         e3            2017-02-14 08:30:00 
    3    B         e3            2017-02-14 05:20:00 
    4    B         e4            2017-02-14 05:30:00 
    5    B         e6            2017-02-14 05:45:00 
    6    C         e1            2017-02-14 05:30:00 
    7    C         e3            2017-02-14 07:30:00 
    8    C         e7            2017-02-14 09:35:00 

Desired Result: 预期结果:

         sensor_id  event_id      time                  group 
    0    A         e1            2017-02-14 05:30:00      1
    1    A         e2            2017-02-14 05:45:00      1
    2    A         e3            2017-02-14 08:30:00      2
    3    B         e3            2017-02-14 05:20:00      1
    4    B         e4            2017-02-14 05:30:00      1
    5    B         e6            2017-02-14 05:45:00      1
    6    C         e1            2017-02-14 05:30:00      1
    7    C         e3            2017-02-14 07:30:00      2
    8    C         e7            2017-02-14 09:35:00      3

I understand that I should group by user, event and then, time using timdelta of 1 hour but I have no clue how to do the rest. 我知道我应该使用1小时的timdelta按用户,事件和时间进行分组,但是我不知道如何进行其余操作。 Any tips will be appreciated. 任何提示将不胜感激。

I think you need to go for dual groupby (Hope sensor_id is sorted if not we need to sorted them first) ie 我认为您需要进行双重groupby(如果不是我们需要先对它们进行排序,则希望sensor_id进行排序),即

df['new'] = df.groupby('sensor_id').apply( lambda x : x.groupby(x['time'].dt.hour).ngroup()+1).values

Output : 输出:

sensor_id event_id                time  new
0         A       e1 2017-02-14 05:30:00    1
1         A       e2 2017-02-14 05:45:00    1
2         A       e3 2017-02-14 08:30:00    2
3         B       e3 2017-02-14 05:20:00    1
4         B       e4 2017-02-14 05:30:00    1
5         B       e6 2017-02-14 05:45:00    1
6         C       e1 2017-02-14 05:30:00    1
7         C       e3 2017-02-14 07:30:00    2
8         C       e7 2017-02-14 09:35:00    3

You can use the pd.TimeGrouper + ngroup to group by time frequency. 您可以使用pd.TimeGrouper + ngroup按时间频率分组。

df['time'] = pd.to_datetime(df.time)
df['group'] = df.set_index('time').groupby(['sensor_id', 
                    pd.TimeGrouper(freq='1H')], sort=False).ngroup().values

So far, we have what we want, but we'll need to reset the group value for each sensor_id , so another groupby call is in order. 到目前为止,我们已经有了所需的东西,但是我们需要为每个sensor_id重置group值,因此需要进行另一个groupby调用。

df['group'] = df.groupby('sensor_id').group.apply(lambda x: x - x.min() + 1)

df

  sensor_id event_id                time  group
0         A       e1 2017-02-14 05:30:00      1
1         A       e2 2017-02-14 05:45:00      1
2         A       e3 2017-02-14 08:30:00      2
3         B       e3 2017-02-14 05:20:00      1
4         B       e4 2017-02-14 05:30:00      1
5         B       e6 2017-02-14 05:45:00      1
6         C       e1 2017-02-14 05:30:00      1
7         C       e3 2017-02-14 07:30:00      2
8         C       e7 2017-02-14 09:35:00      3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM