[英]Frequency of repetitive position in pandas data frame
Hi I am working to find out repetitive position of the following data frame: 嗨,我正在努力找出以下数据框的重复位置:
data = pd.DataFrame()
data ['league'] =['A','A','A','A','A','A','B','B','B']
data ['Team'] = ['X','X','X','Y','Y','Y','Z','Z','Z']
data ['week'] =[1,2,3,1,2,3,1,2,3]
data ['position']= [1,1,2,2,2,1,2,3,4]
I will compare the data for position from previous row, it is it the same, I will assign one. 我将比较上一行的位置数据,是否相同,我将分配一个。 If it is different previous row, I will assign as 1
如果与前一行不同,我将分配为1
My expected outcome will be as follow: 我的预期结果如下:
It means I will group by (League, Team and week) and work out the frequency. 这意味着我将按(联赛,球队和周)分组并确定频率。 Can anyone advise how to do that in Pandas
谁能建议在熊猫中做到这一点
Thanks, 谢谢,
Zep 泽普
Use diff
and abs
with fillna
: 将
diff
和abs
与fillna
一起fillna
:
data['frequency'] = data['position'].diff().abs().fillna(0,downcast='infer')
print(data)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
Using groupby gives all zeros, since you are comparing within groups not on whole dataframe. 使用groupby会给出全零,因为您是在组内而不是在整个数据帧上进行比较。
data.groupby(['league', 'Team', 'week'])['position'].diff().fillna(0,downcast='infer')
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Name: position, dtype: int64
Use diff
, and compare against 0
: 使用
diff
,并与0
比较:
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
print(df)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
For performance reasons, you should try to avoid a fillna
call. 出于性能原因,您应该尝试避免执行
fillna
调用。
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['frequency'] = df['position'].diff().abs().fillna(0,downcast='infer')
%%timeit
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
83.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To extend this answer to work in a groupby
, use 要将此答案扩展为在
groupby
工作,请使用
v = df.groupby(['league', 'Team', 'week']).position.diff()
v[np.isnan(v)] = 0
df['frequency'] = v.ne(0).astype(int)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.