简体   繁体   English

熊猫为每个时间仓分配组号

[英]Pandas assign group numbers for each time bin

I have a pandas dataframe that looks like below.我有一个如下所示的熊猫数据框。

Key     Name    Val1    Val2    Timestamp
101     A       10      1       01-10-2019 00:20:21
102     A       12      2       01-10-2019 00:20:21
103     B       10      1       01-10-2019 00:20:26
104     C       20      2       01-10-2019 14:40:45
105     B       21      3       02-10-2019 09:04:06
106     D       24      3       02-10-2019 09:04:12
107     A       24      3       02-10-2019 09:04:14
108     E       32      2       02-10-2019 09:04:20
109     A       10      1       02-10-2019 09:04:22
110     B       10      1       02-10-2019 10:40:49

Starting from the earliest timestamp, that is, '01-10-2019 00:20:21', I need to create time bins of 10 seconds each and assign same group number to all the rows having timestamp fitting in a time bin.从最早的时间戳开始,即“01-10-2019 00:20:21”,我需要创建每个 10 秒的时间段,并将相同的组号分配给所有具有适合时间段的时间戳的行。 The output should look as below.输出应如下所示。

Key     Name    Val1    Val2    Timestamp               Group
101     A       10      1       01-10-2019 00:20:21     1
102     A       12      2       01-10-2019 00:20:21     1
103     B       10      1       01-10-2019 00:20:26     1
104     C       20      2       01-10-2019 14:40:45     2
105     B       21      3       02-10-2019 09:04:06     3
106     D       24      3       02-10-2019 09:04:12     4
107     A       24      3       02-10-2019 09:04:14     4
108     E       32      2       02-10-2019 09:04:20     4
109     A       10      1       02-10-2019 09:04:22     5
110     B       10      1       02-10-2019 10:40:49     6

First time bin: '01-10-2019 00:20:21' to '01-10-2019 00:20:30', Next time bin: '01-10-2019 00:20:31' to '01-10-2019 00:20:40', Next time bin: '01-10-2019 00:20:41' to '01-10-2019 00:20:50', Next time bin: '01-10-2019 00:20:51' to '01-10-2019 00:21:00', Next time bin: '01-10-2019 00:21:01' to '01-10-2019 00:21:10' and so on.. Based on these time bins, 'Group' is assigned for each row.第一个时间段:“01-10-2019 00:20:21”到“01-10-2019 00:20:30”,下一个时间段:“01-10-2019 00:20:31”到“01-” 10-2019 00:20:40',下一个时间段:'01-10-2019 00:20:41'到'01-10-2019 00:20:50',下一个时间段:'01-10-2019 00:20:51' 到 '01-10-2019 00:21:00',下一个时间段:'01-10-2019 00:21:01' 到 '01-10-2019 00:21:10' 和依此类推.. 基于这些时间段,为每一行分配“组”。 It is not mandatory to have consecutive group numbers(If a time bin is not present, it's ok to skip that group number).连续的组号不是强制性的(如果不存在时间仓,可以跳过该组号)。

I have generated this using for loop, but it takes lot of time if data is spread across months.我已经使用 for 循环生成了这个,但是如果数据分布在几个月内会花费很多时间。 Please let me know if this can be done as a pandas operation using a single line of code.请让我知道这是否可以使用一行代码作为 Pandas 操作来完成。 Thanks.谢谢。

Here is an example without loop .这是一个没有loop的例子。 The main approach is round up seconds to specific ranges and use ngroup() .主要方法是将秒数四舍五入到特定范围并使用ngroup()

02-10-2019 09:04:12 -> 02-10-2019 09:04:11
02-10-2019 09:04:14 -> 02-10-2019 09:04:11
02-10-2019 09:04:20 -> 02-10-2019 09:04:11
02-10-2019 09:04:21 -> 02-10-2019 09:04:21
02-10-2019 09:04:25 -> 02-10-2019 09:04:21
...

I use a new temporary column to find some specific range.我使用一个新的临时列来查找一些特定的范围。

df = pd.DataFrame.from_dict({
    'Name': ('A', 'A', 'B', 'C', 'B', 'D', 'A', 'E', 'A', 'B'),
    'Val1': (1, 2, 1, 2, 3, 3, 3, 2, 1, 1),
    'Timestamp': (
        '2019-01-10 00:20:21',
        '2019-01-10 00:20:21',
        '2019-01-10 00:20:26',
        '2019-01-10 14:40:45',
        '2019-02-10 09:04:06',
        '2019-02-10 09:04:12',
        '2019-02-10 09:04:14',
        '2019-02-10 09:04:20',
        '2019-02-10 09:04:22',
        '2019-02-10 10:40:49',
    )
})
# convert str to Timestamp
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# your specific ranges. customize if you need
def sec_to_group(x):
    if 0 <= x.second <= 10:
        x = x.replace(second=0)
    elif 11 <= x.second <= 20:
        x = x.replace(second=11)
    elif 21 <= x.second <= 30:
        x = x.replace(second=21)
    elif 31 <= x.second <= 40:
        x = x.replace(second=31)
    elif 41 <= x.second <= 50:
        x = x.replace(second=41)
    elif 51 <= x.second <= 59:
        x = x.replace(second=51)
    return x


# new column formated_dt(temporary) with formatted seconds
df['formated_dt'] = df['Timestamp'].apply(sec_to_group)
# group by new column + ngroup() and drop
df['Group'] = df.groupby('formated_dt').ngroup()
df.drop(columns=['formated_dt'], inplace=True)
print(df)

Output:输出:

#  Name  Val1           Timestamp  Group
# 0    A     1 2019-01-10 00:20:21      0  <- ngroup() calculates from 0
# 1    A     2 2019-01-10 00:20:21      0
# 2    B     1 2019-01-10 00:20:26      0
# 3    C     2 2019-01-10 14:40:45      1
# 4    B     3 2019-02-10 09:04:06      2
# ....

Also you can try to use TimeGrouper or resample .您也可以尝试使用TimeGrouper 或 resample

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM