繁体   English   中英

如果列在列表中包含超过 x 个值,则删除组

[英]Remove groups if column contain more than x number of value in a list

您好,我有一个元素列表,例如:

list_element=['Elephant','Monkey','Cow','Human','Bird','Snail','Snake','Donkey','Baboon','Orang-Outan']

和一个 dataframe

name  value
G1    Gr.1:4282399-4282564(+):Elephant
G1    SEQAHAHHE
G1    Zr.2:4282387-428245(-):Monkey
G1    GrA.2:42845-428289(+):Monkey
G1    QYEH897EH.3
G1    GrA2S2_ED:42845-4282789(+):Cow
G1    UDDKDDH6
G1    YDDIJBDIB778
G2    Gr.1:423663-4282542(-):Elephant
G2    Gr7E:423609-4282552(+):Elephant
G2    UEHHEE88E8E.2
G2    AP_UUD1_CU_OK-lQGGQ
G2    GrEH:423663-4282542(+):Baboon
G2    Gr7JE:42356-428257(+):Snail
G2    AP_UUD1_CU_OK-lQ8900
G2    ASGSG_E553:423663-4282542(-):Human
G3    GrA98_OK:42845-42867(+):Bird
G3    AGGAGA5567

我保留G1 ,因为我们总共有element <= 3 (猴子、大象和牛)

我删除了G2 ,因为我们的element > 3 (大象、人类、蜗牛和狒狒)

我保留G3因为总共有element <= 3 (Bird)

正如你所看到的,我们为包含'):'

并且预期的 output 将是:

name  value
G1    Gr.1:4282399-4282564(+):Elephant
G1    SEQAHAHHE
G1    Zr.2:4282387-428245(-):Monkey
G1    GrA.2:42845-428289(+):Monkey
G1    QYEH897EH.3
G1    GrA2S2_ED:42845-4282789(+):Cow
G1    UDDKDDH6
G1    YDDIJBDIB778
G3    GrA98_OK:42845-42867(+):Bird
G3    AGGAGA5567

谢谢你的帮助

您可以使用.str.extract提取元素,然后使用groupby().nunique()来计算唯一元素的数量:

s = (df['value'].str.extract('({})'.format('|'.join(list_element)) )[0]
    .groupby(df['name'])
    .transform('nunique') )

df[s<=3]

Output:

   name                             value
0    G1  Gr.1:4282399-4282564(+):Elephant
1    G1                         SEQAHAHHE
2    G1     Zr.2:4282387-428245(-):Monkey
3    G1      GrA.2:42845-428289(+):Monkey
4    G1                       QYEH897EH.3
5    G1    GrA2S2_ED:42845-4282789(+):Cow
6    G1                          UDDKDDH6
7    G1                      YDDIJBDIB778
16   G3      GrA98_OK:42845-42867(+):Bird
17   G3                        AGGAGA5567
df = df.groupby('name').filter(lambda x: len(set(x[x['value'].str.contains(':')]['value'].str.split(':').str[-1].values)) <= 3)
print(df)

印刷:

   name                             value
0    G1  Gr.1:4282399-4282564(+):Elephant
1    G1                         SEQAHAHHE
2    G1     Zr.2:4282387-428245(-):Monkey
3    G1      GrA.2:42845-428289(+):Monkey
4    G1                       QYEH897EH.3
5    G1    GrA2S2_ED:42845-4282789(+):Cow
6    G1                          UDDKDDH6
7    G1                      YDDIJBDIB778
16   G3      GrA98_OK:42845-42867(+):Bird
17   G3                        AGGAGA5567

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM