我想在PANDAS数据框中计算每个主题内的观察次数

Question

I am quite new to using PANDAS and python in general. 我对使用PANDAS和python来说是个新手。

I have a hierarchical data set with several subjects, each of whom have some number of observations. 我有一个包含几个主题的分层数据集，每个主题都有一些观察结果。 The total df is about half a million rows. 总df约为50万行。

I want to calculate the observations number... 我想计算观测值...

## toy problem

d = {'one' : Series(['a', 'a', 'a', 'b', 'b', 'b'], index = [0,1,2,3,4,5]),
     'two' : Series([1.1, 2.5, 3.3, 2.5, 3.3, 9.5], index = [0,1,2,3,4,5])}
df = DataFrame(d)

for i in df.one.unique():
    for j in range(0,len(df[df.one == i])):
        print j

So I want to assign j to a column for each row. 所以我想将j分配给每一行的一列。 I have no problem calculating j but I cannot figure out how to assign it. 我没有问题计算j，但我不知道如何分配它。 I have tried using iloc which is incredibly slow, or writing to a list and then joining this to the df, also really slow (currently running for over 30 mins and counting...). 我曾尝试使用iloc，它速度非常慢，或者写入列表，然后将其加入df，它也非常慢（当前运行了30分钟以上，并且正在计数...）。 I understand that python is best with vectorised problems but I cannot think of a vector solution for this case. 我知道python最适合矢量化问题，但我无法想到这种情况下的矢量化解决方案。

What is the best way to do this? 做这个的最好方式是什么？ It is really easy and quick in R. I am currently migrating to Python & PANDAS under the expectation that it is faster but this doesnt appear to be the case here. 在R中，它确实非常容易且快速。我目前正在迁移到Python＆PANDAS，期望它速度更快，但事实并非如此。

Any advice please? 有什么建议吗？

Answer 1

You could use the GroupBy.cumcount method : 您可以使用GroupBy.cumcount方法：

In [14]: df['j'] = df.groupby('one').cumcount()

In [15]: df
Out[15]: 
  one  two  j
0   a  1.1  0
1   a  2.5  1
2   a  3.3  2
3   b  2.5  0
4   b  3.3  1
5   b  9.5  2

我想在PANDAS数据框中计算每个主题内的观察次数

问题描述

1 个解决方案

解决方案1
2 2014-11-14 15:38:47

我想在PANDAS数据框中计算每个主题内的观察次数

问题描述

1 个解决方案

解决方案1 2 2014-11-14 15:38:47

解决方案1
2 2014-11-14 15:38:47