[英]Counting occurrences of IDs in pandas dataframe
I have aa few dataframes, a few thousand rows each that look similar to this:我有几个数据框,每行几千行,看起来与此类似:
heifers_df
id y ins
200316157 123 2004121
200316157 456 2004121
200316157 789 2004121
200519776 456 2007234
200519776 789 2007234
200812334 123 2010333
200812334 789 2010333
200812334 345 2010333
200812334 567 2010333
I want to use python (pandas or numphy? ) to count the occurrence if each ID, both total occurrences (T) and the number of each occurrence (No) :我想使用 python (pandas 或 numphy?)来计算每个 ID 的出现次数,包括总出现次数(T)和每次出现的次数(No) :
heifers_df
id y ins T No
200316157 123 2004121 3 1
200316157 456 2004121 3 2
200316157 789 2004121 3 3
200519776 456 2007234 2 1
200519776 789 2007234 2 2
200812334 123 2010333 4 1
200812334 789 2010333 4 2
200812334 345 2010333 4 3
200812334 567 2010333 4 4
I've gotten help with this problem in Fortran Counting frequency of variables in text data in Fortran But now I'm trying to accomplish the same in python.我在 Fortran 中得到了解决这个问题的帮助 在 Fortran 中计算文本数据中变量的频率但现在我正试图在 python 中完成同样的工作。
Based on the Fortran code and my beginner knowledge of python and pandas this is what I've tried doing with the first dataframe:基于 Fortran 代码和我对 python 和 pandas 的初学者知识,这是我尝试用第一个 Z6A80764B5DF47C555 做的事情:
i1 = 0
# set i0, i1
# i0: line where specific user id starts
# i1: line where specific user id ends
for i in range(len(heifers_df)) :
i0 = i1 + 1
same_id = True
while same_id == True :
heifers_df.loc[
heifers_df["id"[i]] != heifers_df["id"[i0]], #How do I reference each row within the column?
same_id ] = False
i1 = i
heifers_df["T"] = i1-i0+1
heifers_df["No"] = i-i0+1
But when I run this I get an error:但是当我运行它时,我得到一个错误:
.... heifers_df["id"[i]] != heifers_df["id"[i0]],
KeyError: 'i'
Am I going in the wrong direction with this?我是否走错了方向?
I've tried to search for similar problems and I've seen group by and counting operations but I haven't seen one that glues the result to the IDs in questions and counts each one.我试图搜索类似的问题,并且我已经看到了分组和计数操作,但我还没有看到将结果与问题中的 ID 粘合并计算每个问题的操作。 Any help would be much appreciated.任何帮助将非常感激。
IIUC, if all unique id's can be sorted into contiguous blocks. IIUC,如果所有唯一的 id 都可以排序成连续的块。
df['T'] = df.groupby('id')['id'].transform('count')
df['No'] = df.groupby('id')['id'].cumcount() + 1
df
Output: Output:
id y ins T No
0 200316157 123 2004121 3 1
1 200316157 456 2004121 3 2
2 200316157 789 2004121 3 3
3 200519776 456 2007234 2 1
4 200519776 789 2007234 2 2
5 200812334 123 2010333 4 1
6 200812334 789 2010333 4 2
7 200812334 345 2010333 4 3
8 200812334 567 2010333 4 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.