简体   繁体   English

计算唯一行数 pandas

[英]Count number of unique rows pandas

I want to count the number of unique rows in a pandas dataframe and add a new row as count_index as in example.我想计算 pandas dataframe 中的唯一行数,并添加一个新行作为 count_index ,如示例所示。 In another way, I want to duplicate the index for duplicate rows.以另一种方式,我想复制重复行的索引。

import pandas as pd
df = {'A': [ 8,8,9,9,9,12,12,13,15,15,15],
      'B': [ 1,1,2,2,2,11,11,3,4,4,4],
      'C': [ 10,10,20,20,20,101,101,30,40,40,40],
      'D': [81,81,92,92,92,121,121,134,150,150,150]}
df = pd.DataFrame(df)

print(df.groupby(['A','B','C','D']).size())
#####################################################
      #input
   A    B      C     D
   8    1      10    81 
   8    1      10    81 
   9    2      20    92 
   9    2      20    92 
   9    2      20    92 
  12   11     101   121 
  12   11     101   121 
  13    3      30   134 
  15    4      40   150 
  15    4      40   150 
  15    4      40   150 
 ####################################################
#expected output
    A    B      C     D   Count_index
   8    1      10    81    1
   8    1      10    81    1
   9    2      20    92    2
   9    2      20    92    2  
   9    2      20    92    2
  12   11     101   121    3
  12   11     101   121    3
  13    3      30   134    4
  15    4      40   150    5
  15    4      40   150    5
  15    4      40   150    5

You can do this by counting the number of inverted .duplicated s.您可以通过计算倒置的.duplicated的数量来做到这一点。 We can then use a cumulative to keep an ongoing count of the number of encountered unique rows.然后,我们可以使用累积来保持对遇到的唯一行数的持续计数。

df['count_index'] = (~df.duplicated(keep="first")).cumsum()

print(df)
     A   B    C    D  count_index
0    8   1   10   81            1
1    8   1   10   81            1
2    9   2   20   92            2
3    9   2   20   92            2
4    9   2   20   92            2
5   12  11  101  121            3
6   12  11  101  121            3
7   13   3   30  134            4
8   15   4   40  150            5
9   15   4   40  150            5
10  15   4   40  150            5


You can use a combination of diff().ne(0) or df.ne(df.shift())您可以使用diff().ne(0)df.ne(df.shift())的组合

df.diff().ne(0).all(axis=1).cumsum()

or或者

df.ne(df.shift()).all(axis=1).cumsum()

Output: Output:

0     1
1     1
2     2
3     2
4     2
5     3
6     3
7     4
8     5
9     5
10    5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM