簡體   English   中英

如何計算Dataframe中重復元素的數量並給它一個計數數字

[英]How to count the number of repeated elements in Dataframe and give it a count number

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                     'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
           'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                       'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']}
df1 = pd.DataFrame(data)
df1

對於上面的示例代碼,我想計算同一位置組的“樣本”列中的重復項目,並在新的“重復編號”列中為其指定一個重復編號。 例如,位置組 A 中有 4 個 S1,我想給第一個 S1 重復編號 1,第二個 S1 重復編號為 2,依此類推。 對於位置 B,有 3 個 S1,第一個 S1 重復編號為 1,第二個 S1 重復編號為 2,依此類推。

理想的結果應該是這樣的:

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                      'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
            'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                        'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12'],
       'Repeat Number':['1', '2', '3' ,'4' ,'1' ,'2' ,'1' ,'2', '1', '1', '1', '2',
                      '1', '2', '3' ,'1' ,'2' ,'3' ,'1' ,'1', '2', '1', '2', '1',]}
df1 = pd.DataFrame(data)
df1

我們可以嘗試使用GroupBy.cumcount

blocks = df1['Sample'].ne(df1['Sample'].shift()).cumsum()
df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1)
# if you want str type
#df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1).asype(str) 

每次Sample更改時,塊都會從前一行開始遞增

print(blocks)

0      1
1      1
2      1
3      1
4      2
5      2
6      3
7      3
8      4
9      5
10     6
11     7
12     8
13     8
14     8
15     9
16     9
17     9
18    10
19    11
20    11
21    12
22    12
23    13
Name: Sample, dtype: int64

另一種選擇是:

location_blocks = df1['Location'].str[0]
df1['Repeat Number'] = df1.groupby([location_blocks, 'Sample']).cumcount().add(1)

輸出

print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              1
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

@ansev 的(原始)答案僅在Sample列已經相對於Location列進行排序時才有效,因為它將SampleSample.shift()進行比較。

如果不是這種情況,您應該首先使用sort_values ,或者按Sample列和df1['Location'].str.extract('(^[AZ])')的結果df1['Location'].str.extract('(^[AZ])')

df1['Repeat Number'] = df1.groupby(['Sample', df1['Location'].str.extract('(^[A-Z])')[0]]).cumcount() + 1
print(df1)

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

工作代碼在這里,維護一個字典並更新計數; 僅適用於一個字符組(即 A、B、a、b - Z、z 等)

代碼

dictionary={}

def countdict(s, l):
    l=l[0]
    if dictionary.get(s+l, 0):
        dictionary[s+l]=dictionary[s+l]+1
        return dictionary[s+l]
    else:
        dictionary[s+l]=1
        return 1

data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
                     'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'], 
           'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
                       'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']}
df1 = pd.DataFrame(data)

df1['Repeat Number']=df1.apply(lambda vals: countdict(*vals), axis=1)
df1

輸出

      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

這是一種使用.factorize().groupby().rank() 我創建了臨時列來簡化groupby()語句。

# pull 'A' or 'B' out of the Location column
df1['location_group'] = df1['Location'].str.extract(r'([A-Za-z]+)')

# convert Sample to integer
df1['x'] = df1['Sample'].factorize()[0]

# use .rank(method='first') so that every entry has a unique number
df1['Repeat Number'] = (
    df1.groupby(['location_group', 'Sample'])['x'].rank(method='first')
    .astype(int))

# clean up
df1 = df1.drop(columns=['location_group', 'x'])

# show results
print(df1)


      Sample Location  Repeat Number
0         S1       A1              1
1         S1       A2              2
2         S1       A3              3
3         S1       A4              4
4         S2       A5              1
5         S2       A6              2
6         S3       A7              1
7         S3       A8              2
8         S4       A9              1
9   Negative      A10              1
10  Positive      A11              1
11  Negative      A12              2
12        S1       B1              1
13        S1       B2              2
14        S1       B3              3
15        S2       B4              1
16        S2       B5              2
17        S2       B6              3
18        S3       B7              1
19        S4       B8              1
20        S4       B9              2
21  Positive      B10              1
22  Positive      B11              2
23  Negative      B12              1

然后,我調用了預期結果df2並驗證了:

assert (df1 == df2).all

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM