[英]Python: Counting cumulative occurrences of values in a pandas series
I have a DataFrame that looks like this: 我有一个看起来像这样的DataFrame:
fruit
0 orange
1 orange
2 orange
3 pear
4 orange
5 apple
6 apple
7 pear
8 pear
9 orange
I want to add a column that counts the cumulative occurrences of each value, ie 我想添加一个列,计算每个值的累积次数,即
fruit cum_count
0 orange 1
1 orange 2
2 orange 3
3 pear 1
4 orange 4
5 apple 1
6 apple 2
7 pear 2
8 pear 3
9 orange 5
At the moment I'm doing it like this: 目前我这样做:
df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
... which is fine for 10 rows, but takes a really long time when I'm trying to do the same thing with a few million rows. ...这对10行很好,但是当我尝试用几百万行做同样的事情时需要很长时间。 Is there a more efficient way to do this?
有没有更有效的方法来做到这一点?
You could use groupby
and cumcount
: 你可以使用
groupby
和cumcount
:
df['cum_count'] = df.groupby('fruit').cumcount() + 1
In [16]: df
Out[16]:
fruit cum_count
0 orange 1
1 orange 2
2 orange 3
3 pear 1
4 orange 4
5 apple 1
6 apple 2
7 pear 2
8 pear 3
9 orange 5
Timing 定时
In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
100 loops, best of 3: 3.76 ms per loop
In [9]: %timeit df.groupby('fruit').cumcount() + 1
1000 loops, best of 3: 926 µs per loop
So it's faster in 4 times. 所以它的速度提高了4倍。
Maybe better is use groupby
with cumcount
with specify column, because it is more efficient way: 也许更好的是将
groupby
与cumcount
一起使用指定列,因为它更有效:
df['cum_count'] = df.groupby('fruit' )['fruit'].cumcount() + 1
print df
fruit cum_count
0 orange 1
1 orange 2
2 orange 3
3 pear 1
4 orange 4
5 apple 1
6 apple 2
7 pear 2
8 pear 3
9 orange 5
Comparing len(df) = 10
, my solution is the fastest: 比较
len(df) = 10
,我的解决方案是最快的:
In [3]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 11.67 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 299 µs per loop
In [4]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 12.78 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 921 µs per loop
In [5]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
The slowest run took 4.47 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.72 ms per loop
Comparing len(df) = 10k
: 比较
len(df) = 10k
:
In [7]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 845 µs per loop
In [8]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 5.59 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 1.59 ms per loop
In [9]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
1 loops, best of 3: 5.12 s per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.