简体   繁体   English

计算 pandas dataframe 中 ID 的出现次数

[英]Counting occurrences of IDs in pandas dataframe

I have aa few dataframes, a few thousand rows each that look similar to this:我有几个数据框,每行几千行,看起来与此类似:

heifers_df

       id   y     ins               
200316157 123 2004121 
200316157 456 2004121 
200316157 789 2004121 
200519776 456 2007234 
200519776 789 2007234 
200812334 123 2010333 
200812334 789 2010333 
200812334 345 2010333 
200812334 567 2010333 

I want to use python (pandas or numphy? ) to count the occurrence if each ID, both total occurrences (T) and the number of each occurrence (No) :我想使用 python (pandas 或 numphy?)来计算每个 ID 的出现次数,包括总出现次数(T)每次出现的次数(No)

heifers_df

       id    y      ins  T  No          
200316157  123  2004121  3   1
200316157  456  2004121  3   2
200316157  789  2004121  3   3
200519776  456  2007234  2   1
200519776  789  2007234  2   2
200812334  123  2010333  4   1
200812334  789  2010333  4   2
200812334  345  2010333  4   3
200812334  567  2010333  4   4

I've gotten help with this problem in Fortran Counting frequency of variables in text data in Fortran But now I'm trying to accomplish the same in python.我在 Fortran 中得到了解决这个问题的帮助 在 Fortran 中计算文本数据中变量的频率但现在我正试图在 python 中完成同样的工作。

Based on the Fortran code and my beginner knowledge of python and pandas this is what I've tried doing with the first dataframe:基于 Fortran 代码和我对 python 和 pandas 的初学者知识,这是我尝试用第一个 Z6A80764B5DF47C555 做的事情:

i1 = 0
# set i0, i1
#  i0: line where specific user id starts
#  i1: line where specific user id ends
for i in range(len(heifers_df)) :
    i0 = i1 + 1
    same_id = True
    while same_id == True :
        heifers_df.loc[
            heifers_df["id"[i]] != heifers_df["id"[i0]],     #How do I reference each row within the column?
            same_id ] = False
    i1 = i
    heifers_df["T"] = i1-i0+1
    heifers_df["No"] = i-i0+1

But when I run this I get an error:但是当我运行它时,我得到一个错误:

....  heifers_df["id"[i]] != heifers_df["id"[i0]],
     KeyError: 'i'

Am I going in the wrong direction with this?我是否走错了方向?

I've tried to search for similar problems and I've seen group by and counting operations but I haven't seen one that glues the result to the IDs in questions and counts each one.我试图搜索类似的问题,并且我已经看到了分组和计数操作,但我还没有看到将结果与问题中的 ID 粘合并计算每个问题的操作。 Any help would be much appreciated.任何帮助将非常感激。

IIUC, if all unique id's can be sorted into contiguous blocks. IIUC,如果所有唯一的 id 都可以排序成连续的块。

df['T'] = df.groupby('id')['id'].transform('count')
df['No'] = df.groupby('id')['id'].cumcount() + 1
df

Output: Output:

          id    y      ins  T  No
0  200316157  123  2004121  3   1
1  200316157  456  2004121  3   2
2  200316157  789  2004121  3   3
3  200519776  456  2007234  2   1
4  200519776  789  2007234  2   2
5  200812334  123  2010333  4   1
6  200812334  789  2010333  4   2
7  200812334  345  2010333  4   3
8  200812334  567  2010333  4   4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM