简体   繁体   English

熊猫数据框中重复位置的频率

[英]Frequency of repetitive position in pandas data frame

Hi I am working to find out repetitive position of the following data frame: 嗨,我正在努力找出以下数据框的重复位置:

data = pd.DataFrame()
data ['league'] =['A','A','A','A','A','A','B','B','B']
data ['Team'] = ['X','X','X','Y','Y','Y','Z','Z','Z']
data ['week'] =[1,2,3,1,2,3,1,2,3]
data ['position']= [1,1,2,2,2,1,2,3,4]

I will compare the data for position from previous row, it is it the same, I will assign one. 我将比较上一行的位置数据,是否相同,我将分配一个。 If it is different previous row, I will assign as 1 如果与前一行不同,我将分配为1

My expected outcome will be as follow: 我的预期结果如下:

在此处输入图片说明

It means I will group by (League, Team and week) and work out the frequency. 这意味着我将按(联赛,球队和周)分组并确定频率。 Can anyone advise how to do that in Pandas 谁能建议在熊猫中做到这一点

Thanks, 谢谢,

Zep 泽普

Use diff and abs with fillna : diffabsfillna一起fillna

data['frequency'] = data['position'].diff().abs().fillna(0,downcast='infer')

print(data)
  league Team  week  position  frequency
0      A    X     1         1          0
1      A    X     2         1          0
2      A    X     3         2          1
3      A    Y     1         2          0
4      A    Y     2         2          0
5      A    Y     3         1          1
6      B    Z     1         2          1
7      B    Z     2         3          1
8      B    Z     3         4          1

Using groupby gives all zeros, since you are comparing within groups not on whole dataframe. 使用groupby会给出全零,因为您是在组内而不是在整个数据帧上进行比较。

data.groupby(['league', 'Team', 'week'])['position'].diff().fillna(0,downcast='infer')

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
Name: position, dtype: int64

Use diff , and compare against 0 : 使用diff ,并与0比较:

v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)

print(df)
  league Team  week  position  frequency
0      A    X     1         1          0
1      A    X     2         1          0
2      A    X     3         2          1
3      A    Y     1         2          0
4      A    Y     2         2          0
5      A    Y     3         1          1
6      B    Z     1         2          1
7      B    Z     2         3          1
8      B    Z     3         4          1

For performance reasons, you should try to avoid a fillna call. 出于性能原因,您应该尝试避免执行fillna调用。

df = pd.concat([df] * 100000, ignore_index=True)

%timeit df['frequency'] = df['position'].diff().abs().fillna(0,downcast='infer')
%%timeit
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)

83.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

To extend this answer to work in a groupby , use 要将此答案扩展为在groupby工作,请使用

v = df.groupby(['league', 'Team', 'week']).position.diff()
v[np.isnan(v)] = 0

df['frequency'] = v.ne(0).astype(int)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM