[英]How to count the number of repeated elements in Dataframe and give it a count number
data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'],
'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']}
df1 = pd.DataFrame(data)
df1
對於上面的示例代碼,我想計算同一位置組的“樣本”列中的重復項目,並在新的“重復編號”列中為其指定一個重復編號。 例如,位置組 A 中有 4 個 S1,我想給第一個 S1 重復編號 1,第二個 S1 重復編號為 2,依此類推。 對於位置 B,有 3 個 S1,第一個 S1 重復編號為 1,第二個 S1 重復編號為 2,依此類推。
理想的結果應該是這樣的:
data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'],
'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12'],
'Repeat Number':['1', '2', '3' ,'4' ,'1' ,'2' ,'1' ,'2', '1', '1', '1', '2',
'1', '2', '3' ,'1' ,'2' ,'3' ,'1' ,'1', '2', '1', '2', '1',]}
df1 = pd.DataFrame(data)
df1
我們可以嘗試使用GroupBy.cumcount
。
blocks = df1['Sample'].ne(df1['Sample'].shift()).cumsum()
df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1)
# if you want str type
#df1['Repeat Number'] = df1.groupby(blocks).cumcount().add(1).asype(str)
每次Sample
更改時,塊都會從前一行開始遞增
print(blocks)
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 4
9 5
10 6
11 7
12 8
13 8
14 8
15 9
16 9
17 9
18 10
19 11
20 11
21 12
22 12
23 13
Name: Sample, dtype: int64
另一種選擇是:
location_blocks = df1['Location'].str[0]
df1['Repeat Number'] = df1.groupby([location_blocks, 'Sample']).cumcount().add(1)
輸出
print(df1)
Sample Location Repeat Number
0 S1 A1 1
1 S1 A2 2
2 S1 A3 3
3 S1 A4 4
4 S2 A5 1
5 S2 A6 2
6 S3 A7 1
7 S3 A8 2
8 S4 A9 1
9 Negative A10 1
10 Positive A11 1
11 Negative A12 1
12 S1 B1 1
13 S1 B2 2
14 S1 B3 3
15 S2 B4 1
16 S2 B5 2
17 S2 B6 3
18 S3 B7 1
19 S4 B8 1
20 S4 B9 2
21 Positive B10 1
22 Positive B11 2
23 Negative B12 1
@ansev 的(原始)答案僅在Sample
列已經相對於Location
列進行排序時才有效,因為它將Sample
與Sample.shift()
進行比較。
如果不是這種情況,您應該首先使用sort_values
,或者按Sample
列和df1['Location'].str.extract('(^[AZ])')
的結果df1['Location'].str.extract('(^[AZ])')
:
df1['Repeat Number'] = df1.groupby(['Sample', df1['Location'].str.extract('(^[A-Z])')[0]]).cumcount() + 1
print(df1)
Sample Location Repeat Number
0 S1 A1 1
1 S1 A2 2
2 S1 A3 3
3 S1 A4 4
4 S2 A5 1
5 S2 A6 2
6 S3 A7 1
7 S3 A8 2
8 S4 A9 1
9 Negative A10 1
10 Positive A11 1
11 Negative A12 2
12 S1 B1 1
13 S1 B2 2
14 S1 B3 3
15 S2 B4 1
16 S2 B5 2
17 S2 B6 3
18 S3 B7 1
19 S4 B8 1
20 S4 B9 2
21 Positive B10 1
22 Positive B11 2
23 Negative B12 1
工作代碼在這里,維護一個字典並更新計數; 僅適用於一個字符組(即 A、B、a、b - Z、z 等)
代碼
dictionary={}
def countdict(s, l):
l=l[0]
if dictionary.get(s+l, 0):
dictionary[s+l]=dictionary[s+l]+1
return dictionary[s+l]
else:
dictionary[s+l]=1
return 1
data = {'Sample':['S1', 'S1', 'S1' ,'S1' ,'S2' ,'S2' ,'S3' ,'S3', 'S4', 'Negative', 'Positive', 'Negative',
'S1', 'S1', 'S1' ,'S2' ,'S2' ,'S2' ,'S3' ,'S4', 'S4', 'Positive', 'Positive', 'Negative'],
'Location':['A1', 'A2', 'A3' ,'A4' ,'A5' ,'A6' ,'A7' ,'A8', 'A9', 'A10', 'A11', 'A12',
'B1', 'B2', 'B3' ,'B4' ,'B5' ,'B6' ,'B7' ,'B8', 'B9', 'B10', 'B11', 'B12']}
df1 = pd.DataFrame(data)
df1['Repeat Number']=df1.apply(lambda vals: countdict(*vals), axis=1)
df1
輸出
Sample Location Repeat Number
0 S1 A1 1
1 S1 A2 2
2 S1 A3 3
3 S1 A4 4
4 S2 A5 1
5 S2 A6 2
6 S3 A7 1
7 S3 A8 2
8 S4 A9 1
9 Negative A10 1
10 Positive A11 1
11 Negative A12 2
12 S1 B1 1
13 S1 B2 2
14 S1 B3 3
15 S2 B4 1
16 S2 B5 2
17 S2 B6 3
18 S3 B7 1
19 S4 B8 1
20 S4 B9 2
21 Positive B10 1
22 Positive B11 2
23 Negative B12 1
這是一種使用.factorize()
和.groupby().rank()
。 我創建了臨時列來簡化groupby()
語句。
# pull 'A' or 'B' out of the Location column
df1['location_group'] = df1['Location'].str.extract(r'([A-Za-z]+)')
# convert Sample to integer
df1['x'] = df1['Sample'].factorize()[0]
# use .rank(method='first') so that every entry has a unique number
df1['Repeat Number'] = (
df1.groupby(['location_group', 'Sample'])['x'].rank(method='first')
.astype(int))
# clean up
df1 = df1.drop(columns=['location_group', 'x'])
# show results
print(df1)
Sample Location Repeat Number
0 S1 A1 1
1 S1 A2 2
2 S1 A3 3
3 S1 A4 4
4 S2 A5 1
5 S2 A6 2
6 S3 A7 1
7 S3 A8 2
8 S4 A9 1
9 Negative A10 1
10 Positive A11 1
11 Negative A12 2
12 S1 B1 1
13 S1 B2 2
14 S1 B3 3
15 S2 B4 1
16 S2 B5 2
17 S2 B6 3
18 S3 B7 1
19 S4 B8 1
20 S4 B9 2
21 Positive B10 1
22 Positive B11 2
23 Negative B12 1
然后,我調用了預期結果df2
並驗證了:
assert (df1 == df2).all
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.