Python：计算pandas系列中值的累积出现次数

Question

I have a DataFrame that looks like this: 我有一个看起来像这样的DataFrame：

    fruit
0  orange
1  orange
2  orange
3    pear
4  orange
5   apple
6   apple
7    pear
8    pear
9  orange

I want to add a column that counts the cumulative occurrences of each value, ie 我想添加一个列，计算每个值的累积次数，即

    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

At the moment I'm doing it like this: 目前我这样做：

df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]

... which is fine for 10 rows, but takes a really long time when I'm trying to do the same thing with a few million rows. ...这对10行很好，但是当我尝试用几百万行做同样的事情时需要很长时间。 Is there a more efficient way to do this? 有没有更有效的方法来做到这一点？

Answer 1

You could use groupby and cumcount : 你可以使用groupby和cumcount ：

df['cum_count'] = df.groupby('fruit').cumcount() + 1

In [16]: df
Out[16]:
    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

Timing 定时

In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
100 loops, best of 3: 3.76 ms per loop

In [9]: %timeit df.groupby('fruit').cumcount() + 1
1000 loops, best of 3: 926 µs per loop

So it's faster in 4 times. 所以它的速度提高了4倍。

Answer 2

Maybe better is use groupby with cumcount with specify column, because it is more efficient way: 也许更好的是将groupby与cumcount一起使用指定列，因为它更有效：

df['cum_count'] = df.groupby('fruit' )['fruit'].cumcount() + 1
print df

    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

Comparing len(df) = 10 , my solution is the fastest: 比较len(df) = 10 ，我的解决方案是最快的：

In [3]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 11.67 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 299 µs per loop

In [4]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 12.78 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 921 µs per loop

In [5]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
The slowest run took 4.47 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.72 ms per loop

Comparing len(df) = 10k : 比较len(df) = 10k ：

In [7]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 845 µs per loop

In [8]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 5.59 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 1.59 ms per loop

In [9]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
1 loops, best of 3: 5.12 s per loop

Python：计算pandas系列中值的累积出现次数

问题描述

2 个解决方案

解决方案1
4 2016-02-18 14:35:40

解决方案2
1 2016-02-18 14:37:43

Python：计算pandas系列中值的累积出现次数

问题描述

2 个解决方案

解决方案1 4 2016-02-18 14:35:40

解决方案2 1 2016-02-18 14:37:43

解决方案1
4 2016-02-18 14:35:40

解决方案2
1 2016-02-18 14:37:43