如何計算Dataframe中重復元素的數量並給它一個計數數字

Question

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                     'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
           'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                       'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']}
df1 = pd.DataFrame(data)
df1

對於上面的示例代碼，我想計算同一位置組的“樣本”列中的重復項目，並在新的“重復編號”列中為其指定一個重復編號。 例如，位置組 A 中有 4 個 S1，我想給第一個 S1 重復編號 1，第二個 S1 重復編號為 2，依此類推。 對於位置 B，有 3 個 S1，第一個 S1 重復編號為 1，第二個 S1 重復編號為 2，依此類推。

理想的結果應該是這樣的：

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                      'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
            'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                        'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12'],
       'Repeat Number':['1', '2', '3' ,'4' ,'1' ,'2' ,'1' ,'2', '1', '1', '1', '2',
                      '1', '2', '3' ,'1' ,'2' ,'3' ,'1' ,'1', '2', '1', '2', '1',]}
df1 = pd.DataFrame(data)
df1

Answer 1

我們可以嘗試使用GroupBy.cumcount 。

blocks = df1['Sample'].ne(df1['Sample'].shift()).cumsum()
df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1)
# if you want str type
#df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1).asype(str)

每次Sample更改時，塊都會從前一行開始遞增

print(blocks)

0      1
1      1
2      1
3      1
4      2
5      2
6      3
7      3
8      4
9      5
10     6
11     7
12     8
13     8
14     8
15     9
16     9
17     9
18    10
19    11
20    11
21    12
22    12
23    13
Name: Sample, dtype: int64

另一種選擇是：

location_blocks = df1['Location'].str[0]
df1['Repeat Number'] = df1.groupby([location_blocks, 'Sample']).cumcount().add(1)

輸出

print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              1
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

Answer 2

@ansev 的（原始）答案僅在Sample列已經相對於Location列進行排序時才有效，因為它將Sample與Sample.shift()進行比較。

如果不是這種情況，您應該首先使用sort_values ，或者按Sample列和df1['Location'].str.extract('(^[AZ])')的結果df1['Location'].str.extract('(^[AZ])') ：

df1['Repeat Number'] = df1.groupby(['Sample', df1['Location'].str.extract('(^[A-Z])')[0]]).cumcount() + 1
print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

Answer 3

工作代碼在這里，維護一個字典並更新計數； 僅適用於一個字符組（即 A、B、a、b - Z、z 等）

代碼

dictionary={}

def countdict(s, l):
    l=l[0]
    if dictionary.get(s+l, 0):
        dictionary[s+l]=dictionary[s+l]+1
        return dictionary[s+l]
    else:
        dictionary[s+l]=1
        return 1

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                     'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
           'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                       'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']}
df1 = pd.DataFrame(data)

df1['Repeat Number']=df1.apply(lambda vals: countdict(*vals), axis=1)
df1

輸出

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

Answer 4

這是一種使用.factorize()和.groupby().rank() 。 我創建了臨時列來簡化groupby()語句。

# pull 'A' or 'B' out of the Location column
df1['location_group'] = df1['Location'].str.extract(r'([A-Za-z]+)')

# convert Sample to integer
df1['x'] = df1['Sample'].factorize()[0]

# use .rank(method='first') so that every entry has a unique number
df1['Repeat Number'] = (
    df1.groupby(['location_group', 'Sample'])['x'].rank(method='first')
    .astype(int))

# clean up
df1 = df1.drop(columns=['location_group', 'x'])

# show results
print(df1)


      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

然后，我調用了預期結果df2並驗證了：

assert (df1 == df2).all

如何計算Dataframe中重復元素的數量並給它一個計數數字

問題描述

4 個解決方案

解決方案1
2 2020-08-30 13:44:58

解決方案2
2 已采納 2020-08-30 13:56:25

解決方案3
0 2020-08-30 13:53:28

解決方案4
0 2020-08-30 18:07:25

如何計算Dataframe中重復元素的數量並給它一個計數數字

問題描述

4 個解決方案

解決方案1 2 2020-08-30 13:44:58

解決方案2 2 已采納 2020-08-30 13:56:25

解決方案3 0 2020-08-30 13:53:28

解決方案4 0 2020-08-30 18:07:25

解決方案1
2 2020-08-30 13:44:58

解決方案2
2 已采納 2020-08-30 13:56:25

解決方案3
0 2020-08-30 13:53:28

解決方案4
0 2020-08-30 18:07:25