[英]I want to count number of observations within each subject in PANDAS dataframe
I am quite new to using PANDAS and python in general. 我对使用PANDAS和python来说是个新手。
I have a hierarchical data set with several subjects, each of whom have some number of observations. 我有一个包含几个主题的分层数据集,每个主题都有一些观察结果。 The total df is about half a million rows.
总df约为50万行。
I want to calculate the observations number... 我想计算观测值...
## toy problem
d = {'one' : Series(['a', 'a', 'a', 'b', 'b', 'b'], index = [0,1,2,3,4,5]),
'two' : Series([1.1, 2.5, 3.3, 2.5, 3.3, 9.5], index = [0,1,2,3,4,5])}
df = DataFrame(d)
for i in df.one.unique():
for j in range(0,len(df[df.one == i])):
print j
So I want to assign j to a column for each row. 所以我想将j分配给每一行的一列。 I have no problem calculating j but I cannot figure out how to assign it.
我没有问题计算j,但我不知道如何分配它。 I have tried using iloc which is incredibly slow, or writing to a list and then joining this to the df, also really slow (currently running for over 30 mins and counting...).
我曾尝试使用iloc,它速度非常慢,或者写入列表,然后将其加入df,它也非常慢(当前运行了30分钟以上,并且正在计数...)。 I understand that python is best with vectorised problems but I cannot think of a vector solution for this case.
我知道python最适合矢量化问题,但我无法想到这种情况下的矢量化解决方案。
What is the best way to do this? 做这个的最好方式是什么? It is really easy and quick in R. I am currently migrating to Python & PANDAS under the expectation that it is faster but this doesnt appear to be the case here.
在R中,它确实非常容易且快速。我目前正在迁移到Python&PANDAS,期望它速度更快,但事实并非如此。
Any advice please? 有什么建议吗?
You could use the GroupBy.cumcount
method : 您可以使用
GroupBy.cumcount
方法 :
In [14]: df['j'] = df.groupby('one').cumcount()
In [15]: df
Out[15]:
one two j
0 a 1.1 0
1 a 2.5 1
2 a 3.3 2
3 b 2.5 0
4 b 3.3 1
5 b 9.5 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.