[英]Counting recurring sequences in pandas
I am new to identifying patterns with Python and could use some direction. 我是使用Python识别模式的新手,可以使用一些指导。 I have a large data set whose sample I've pasted below: 我有一个很大的数据集,下面粘贴了其样本:
My objective is to find any sequential pattern of 'foo' 'bar' 'baz'
, and to count recurring patterns of 'foo' 'bar' 'baz'
if the pattern repeats itself multiple times, while grouping by id
. 我的目标是找到'foo' 'bar' 'baz'
任何顺序模式,并计算重复'foo' 'bar' 'baz'
模式,如果该模式重复多次,并按id
分组。
id class_name created_at
0 1 foo 2019-02-08 19:11:04
1 1 bar 2019-02-08 19:11:34
2 1 foo 2019-02-08 19:12:04
3 1 baz 2019-02-08 19:12:35
4 1 bar 2019-02-08 19:13:05
5 1 foo 2019-02-08 19:13:35
6 1 bar 2019-02-08 19:14:04
7 1 baz 2019-02-08 19:14:35
8 1 foo 2019-02-08 19:15:05
9 1 bar 2019-02-08 19:15:35
10 1 baz 2019-02-08 19:16:03
11 2 foo 2019-02-08 19:16:34
12 2 bar 2019-02-08 19:17:07
13 2 foo 2019-02-08 19:17:42
14 2 bar 2019-02-08 19:18:04
15 2 baz 2019-02-08 19:18:34
16 2 baz 2019-02-08 19:19:04
17 2 bar 2019-02-08 19:19:34
18 2 bar 2019-02-08 19:20:04
19 2 foo 2019-02-08 19:20:34
For example, the output from the above dataset would look something like: 例如,上述数据集的输出如下所示:
id count start_time end_time
1 2 2019-02-08 19:13:35 2019-02-08 19:16:03
2 1 2019-02-08 19:17:42 2019-02-08 19:18:34
the column types are as follows: 列类型如下:
id int64
class_name object
created_at datetime64[ns]
dtype: object
what modules would be best suited for this task? 哪些模块最适合此任务?
here is the data: 这是数据:
{'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'class_name': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'baz', 4: 'bar', 5: 'foo', 6: 'bar', 7: 'baz', 8: 'foo', 9: 'bar', 10: 'baz', 11: 'foo', 12: 'bar', 13: 'foo', 14: 'bar', 15: 'baz', 16: 'baz', 17: 'bar', 18: 'bar', 19: 'foo'}, 'created_at': {0: Timestamp('2019-02-08 19:11:04'), 1: Timestamp('2019-02-08 19:11:34'), 2: Timestamp('2019-02-08 19:12:04'), 3: Timestamp('2019-02-08 19:12:35'), 4: Timestamp('2019-02-08 19:13:05'), 5: Timestamp('2019-02-08 19:13:35'), 6: Timestamp('2019-02-08 19:14:04'), 7: Timestamp('2019-02-08 19:14:35'), 8: Timestamp('2019-02-08 19:15:05'), 9: Timestamp('2019-02-08 19:15:35'), 10: Timestamp('2019-02-08 19:16:03'), 11: Timestamp('2019-02-08 19:16:34'), 12: Timestamp('2019-02-08 19:17:07'), 13: Timestamp('2019-02-08 19:17:42'), 14: Timestamp('2019-02-08 19:18:04'), 15: Timestamp('2019-02-08 19:18:34'), 16: Timestamp('2019-02-08 19:19:04'), 17: Timestamp('2019-02-08 19:19:34'), 18: Timestamp('2019-02-08 19:20:04'), 19: Timestamp('2019-02-08 19:20:34')}}
Takes a few steps but ultimately gets there.. 采取了一些步骤,但最终到了。
Initialize Data: 初始化数据:
import pandas as pd
from pandas import Timestamp
import numpy as np
dict_ ={'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'class_name': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'baz', 4: 'bar', 5: 'foo', 6: 'bar', 7: 'baz', 8: 'foo', 9: 'bar', 10: 'baz', 11: 'foo', 12: 'bar', 13: 'foo', 14: 'bar', 15: 'baz', 16: 'baz', 17: 'bar', 18: 'bar', 19: 'foo'}, 'created_at': {0: Timestamp('2019-02-08 19:11:04'), 1: Timestamp('2019-02-08 19:11:34'), 2: Timestamp('2019-02-08 19:12:04'), 3: Timestamp('2019-02-08 19:12:35'), 4: Timestamp('2019-02-08 19:13:05'), 5: Timestamp('2019-02-08 19:13:35'), 6: Timestamp('2019-02-08 19:14:04'), 7: Timestamp('2019-02-08 19:14:35'), 8: Timestamp('2019-02-08 19:15:05'), 9: Timestamp('2019-02-08 19:15:35'), 10: Timestamp('2019-02-08 19:16:03'), 11: Timestamp('2019-02-08 19:16:34'), 12: Timestamp('2019-02-08 19:17:07'), 13: Timestamp('2019-02-08 19:17:42'), 14: Timestamp('2019-02-08 19:18:04'), 15: Timestamp('2019-02-08 19:18:34'), 16: Timestamp('2019-02-08 19:19:04'), 17: Timestamp('2019-02-08 19:19:34'), 18: Timestamp('2019-02-08 19:20:04'), 19: Timestamp('2019-02-08 19:20:34')}}
df=pd.DataFrame(dict_)
I shift the end date back two spots so that we have a beginning and end for each 3 steps. 我将结束日期后移两个位置,以便每3个步骤都有一个起点和终点。 I do this within the groups to maintain continuity: 我在小组内这样做是为了保持连续性:
df['end_time'] = df.groupby('id')['created_at'].shift(-2)
To find spots where we have sequential ['foo', 'bar', 'baz']
, I zip together df['class_name']
, along with shift(-1)
and shift(-2)
of class_name
为了找到在那里我们有连续斑点['foo', 'bar', 'baz']
我拉链一起df['class_name']
随着shift(-1)
和shift(-2)
的class_name
[[x,y,z] for x,y,z in zip(df['class_name'], df['class_name'].shift(-1), df['class_name'].shift(-2))]
[['foo', 'bar', 'foo'],
['bar', 'foo', 'baz'],
['foo', 'baz', 'bar'],
['baz', 'bar', 'foo'],
['bar', 'foo', 'bar'],
['foo', 'bar', 'baz'],
['bar', 'baz', 'foo'],
['baz', 'foo', 'bar'],
['foo', 'bar', 'baz'],
['bar', 'baz', 'foo'],
['baz', 'foo', 'bar'],
['foo', 'bar', 'foo'],
['bar', 'foo', 'bar'],
['foo', 'bar', 'baz'],
['bar', 'baz', 'baz'],
['baz', 'baz', 'bar'],
['baz', 'bar', 'bar'],
['bar', 'bar', 'foo'],
['bar', 'foo', nan],
['foo', nan, nan]]
I then convert that to a numpy array and compare that with what we're looking for. 然后,我将其转换为一个numpy数组,并将其与我们正在寻找的内容进行比较。
matches = np.array([[x,y,z] for x,y,z in zip(df['class_name'], df['class_name'].shift(-1), df['class_name'].shift(-2))]) == ['foo', 'bar', 'baz']
array([[ True, True, False],
[False, False, True],
[ True, False, False],
[False, True, False],
[False, False, False],
[ True, True, True],
[False, False, False],
[False, False, False],
[ True, True, True],
[False, False, False],
[False, False, False],
[ True, True, False],
[False, False, False],
[ True, True, True],
[False, False, True],
[False, False, False],
[False, True, False],
[False, True, False],
[False, False, False],
[ True, False, False]])
Then to get the subsetting vector I just .all()
compare the array. 然后,要获得子集向量,我只需.all()
比较数组。 This will give us the starting points 这将为我们提供起点
vec = [x.all() == True for x in x]
[False,
False,
False,
False,
False,
True,
False,
False,
True,
False,
False,
False,
False,
True,
False,
False,
False,
False,
False,
False]
Now we subset and inspect 现在我们子集并检查
subset = df.loc[vec]
id class_name created_at end_time
5 1 foo 2019-02-08 19:13:35 2019-02-08 19:14:35
8 1 foo 2019-02-08 19:15:05 2019-02-08 19:16:03
13 2 foo 2019-02-08 19:17:42 2019-02-08 19:18:34
Since we want grouped versions, we can just groupby
and agg
to get the final result. 由于我们需要分组版本,因此我们可以仅通过groupby
和agg
获得最终结果。
subset.groupby('id').agg({'class_name':'count', 'created_at':'min', 'end_time':'max'})
class_name created_at end_time
id
1 2 2019-02-08 19:13:35 2019-02-08 19:16:03
2 1 2019-02-08 19:17:42 2019-02-08 19:18:34
Adding a sequence comparison method, you can use rolling()
. 添加序列比较方法后,可以使用rolling()
。
df['class_name'] = pd.factorize(df['class_name'])[0]
def custom_func(frame):
frame['match']=frame['class_name'].rolling(3).apply(lambda x: np.array_equal(x, [0, 1, 2]), raw=True)
frame['start_time'] = frame['created_at'].shift(2)
frame = frame[frame['match']==1].agg({'match':'count','start_time':'min','created_at':'max'})
return frame
df = df.groupby('id').apply(lambda frame:custom_func(frame)).rename(columns={'match':'count','created_at':'end_time'})
print(df)
count start_time end_time
id
1 2 2019-02-08 19:13:35 2019-02-08 19:16:03
2 1 2019-02-08 19:17:42 2019-02-08 19:18:34
一个简单的解决方案可以是:
df.groupby("id").apply(lambda x : len(re.findall("foo bar baz", ' '.join(x['class_name']))))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.