i have a dataframe like below:
Time col1 col2 col3
2 a x 10
3 b y 11
1 a x 10
6 c z 12
20 c x 13
23 a y 24
14 c x 13
16 b y 11
...
and want to add a column to every row of dataframe based on other rows of dataframe, this is out dataframe:
Time col1 col2 col3 cumVal
2 a x 10 2
3 b y 11 1
1 a x 10 2
6 c z 12 1
20 c x 13 2
23 a y 24 1
14 c x 13 2
16 b y 11 1
...
i have a try :
df['cumVal'] = 0
for index, row in df.iterrows():
min1 = row['Time']-10
max1 = row['Time']+10
ndf = df[(df.col1 == row.col1)&(df.col2 == row.col2)& (df.col3 ==
row.col3)]
df.iloc[index]['cumVal'] = len(ndf.query('@min1 <= Time <= @max1'))
but it is very slow, anybody could change my code to get more faster?
You can use groupby
on 'col1', 'col2' and 'col3' and in the transform
per group, use np.subtract
as a ufunc of outer
to calculate all the differences between values in the column 'Time' of this group, then with np.abs
inferior to 10 and np.sum
on axis=0, you can calculate how many values are within +/- 10 for each value.
import numpy as np
df['cumVal'] = (df.groupby(['col1','col2','col3'])['Time']
.transform(lambda x: (np.abs(np.subtract.outer(x, x))<=10).sum(0)))
print (df)
Time col1 col2 col3 cumVal
0 2.0 a x 10.0 2.0
1 3.0 b y 11.0 1.0
2 1.0 a x 10.0 2.0
3 6.0 c z 12.0 1.0
4 20.0 c x 13.0 2.0
5 23.0 a y 24.0 1.0
6 14.0 c x 13.0 2.0
7 16.0 b y 11.0 1.0
It should give better performance:
df['cumVal'] = 0
for index, row in df.iterrows():
min1 = row['Time']-10
max1 = row['Time']+10
ndf = df[(df.Time>min1)&(df.Time<max1)&(df.col1 == row.col1)&(df.col2 == row.col2)& (df.col3 ==
row.col3)]
df.iloc[index]['cumVal'] = len(ndf)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.