[英]Remove groups if column contain more than x number of value in a list
您好,我有一個元素列表,例如:
list_element=['Elephant','Monkey','Cow','Human','Bird','Snail','Snake','Donkey','Baboon','Orang-Outan']
和一個 dataframe
name value
G1 Gr.1:4282399-4282564(+):Elephant
G1 SEQAHAHHE
G1 Zr.2:4282387-428245(-):Monkey
G1 GrA.2:42845-428289(+):Monkey
G1 QYEH897EH.3
G1 GrA2S2_ED:42845-4282789(+):Cow
G1 UDDKDDH6
G1 YDDIJBDIB778
G2 Gr.1:423663-4282542(-):Elephant
G2 Gr7E:423609-4282552(+):Elephant
G2 UEHHEE88E8E.2
G2 AP_UUD1_CU_OK-lQGGQ
G2 GrEH:423663-4282542(+):Baboon
G2 Gr7JE:42356-428257(+):Snail
G2 AP_UUD1_CU_OK-lQ8900
G2 ASGSG_E553:423663-4282542(-):Human
G3 GrA98_OK:42845-42867(+):Bird
G3 AGGAGA5567
我保留G1
,因為我們總共有element <= 3
(猴子、大象和牛)
我刪除了G2
,因為我們的element > 3
(大象、人類、蝸牛和狒狒)
我保留G3
因為總共有element <= 3
(Bird)
正如你所看到的,我們為包含'):'
並且預期的 output 將是:
name value
G1 Gr.1:4282399-4282564(+):Elephant
G1 SEQAHAHHE
G1 Zr.2:4282387-428245(-):Monkey
G1 GrA.2:42845-428289(+):Monkey
G1 QYEH897EH.3
G1 GrA2S2_ED:42845-4282789(+):Cow
G1 UDDKDDH6
G1 YDDIJBDIB778
G3 GrA98_OK:42845-42867(+):Bird
G3 AGGAGA5567
謝謝你的幫助
您可以使用.str.extract
提取元素,然后使用groupby().nunique()
來計算唯一元素的數量:
s = (df['value'].str.extract('({})'.format('|'.join(list_element)) )[0]
.groupby(df['name'])
.transform('nunique') )
df[s<=3]
Output:
name value
0 G1 Gr.1:4282399-4282564(+):Elephant
1 G1 SEQAHAHHE
2 G1 Zr.2:4282387-428245(-):Monkey
3 G1 GrA.2:42845-428289(+):Monkey
4 G1 QYEH897EH.3
5 G1 GrA2S2_ED:42845-4282789(+):Cow
6 G1 UDDKDDH6
7 G1 YDDIJBDIB778
16 G3 GrA98_OK:42845-42867(+):Bird
17 G3 AGGAGA5567
df = df.groupby('name').filter(lambda x: len(set(x[x['value'].str.contains(':')]['value'].str.split(':').str[-1].values)) <= 3)
print(df)
印刷:
name value
0 G1 Gr.1:4282399-4282564(+):Elephant
1 G1 SEQAHAHHE
2 G1 Zr.2:4282387-428245(-):Monkey
3 G1 GrA.2:42845-428289(+):Monkey
4 G1 QYEH897EH.3
5 G1 GrA2S2_ED:42845-4282789(+):Cow
6 G1 UDDKDDH6
7 G1 YDDIJBDIB778
16 G3 GrA98_OK:42845-42867(+):Bird
17 G3 AGGAGA5567
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.