简体   繁体   English

Pandas str.contains() 在某些情况下不起作用

[英]Pandas str.contains() not working in some cases

I have a script, that does linear modelling between pairs of conditions: The dataframe looks like this:我有一个脚本,可以在成对条件之间进行线性建模:dataframe 看起来像这样:

   Accession                                   Sequence variable        value
0     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    39.300171
1     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   132.637125
2     O14548                          [R].gLPDQMLYr.[T]     DMSO  1165.245826
3     O14548                          [R].gLPDQMLYr.[T]     DMSO   642.971908
4     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    83.906058
5     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   160.718841
6     O14548                          [R].gLPDQMLYr.[T]     DMSO  1240.856710
7     O14548                          [R].gLPDQMLYr.[T]     DMSO   557.508092
8     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    56.228425
9     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   302.346775
10    O14548                          [R].gLPDQMLYr.[T]     DMSO  1176.998098
11    O14548                          [R].gLPDQMLYr.[T]     DMSO   766.993819
12    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.387985
13    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.175678
14    O14548                          [R].gLPDQMLYr.[T]     CCCP   885.174420
15    O14548                          [R].gLPDQMLYr.[T]     CCCP   130.458963
16    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.557088
17    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.095801
18    O14548                          [R].gLPDQMLYr.[T]     CCCP   612.171540
19    O14548                          [R].gLPDQMLYr.[T]     CCCP    46.449990
20    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     6.016590
21    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.466220
22    O14548                          [R].gLPDQMLYr.[T]     CCCP   586.392482
23    O14548                          [R].gLPDQMLYr.[T]     CCCP   303.857624
24    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]      C+I    44.627773
25    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]      C+I     0.841494
26    O14548                          [R].gLPDQMLYr.[T]      C+I   632.355914
27    O14548                          [R].gLPDQMLYr.[T]      C+I   162.333292
28    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]      C+I    12.075158
29    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]      C+I   154.253098
30    O14548                          [R].gLPDQMLYr.[T]      C+I   159.767999
31    O14548                          [R].gLPDQMLYr.[T]      C+I  1031.399087
32    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]      C+I   150.724386
33    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]      C+I   260.684163
34    O14548                          [R].gLPDQMLYr.[T]      C+I   141.459156
35    O14548                          [R].gLPDQMLYr.[T]      C+I   262.659208

I now want to fit a linear model for each pair.我现在想为每对安装一个线性 model。 I get the pairs by the following code:我通过以下代码获得对:

def tessa(source):
    result = []
    for p1 in range(len(source)):
            for p2 in range(p1+1,len(source)):
                    result.append([source[p1],source[p2]])
    return result

unique_conditions = list(set(conditions))
pairs = tessa(unique_conditions)
print(pairs)

I am looping over the pairs and filtering by dataframe for the conditions:我正在循环这些对并通过 dataframe 过滤条件:

for pair in pairs:
        
        pair.sort()
        print(pair)
        print(pair[0],pair[1])
        temp=melted_Peptides[(melted_Peptides['variable'].str.contains(pair[0]))|(melted_Peptides['variable'].str.contains(pair[1]))]
        print(temp)

Here comes the problem.问题来了。 It does not filter correctly.The output of this:它没有正确过滤。这个的output:

['C+I', 'CCCP']
C+I CCCP
   Accession                                   Sequence variable       value
12    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP    0.387985
13    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP    0.175678
14    O14548                          [R].gLPDQMLYr.[T]     CCCP  885.174420
15    O14548                          [R].gLPDQMLYr.[T]     CCCP  130.458963
16    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP    0.557088
17    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP    0.095801
18    O14548                          [R].gLPDQMLYr.[T]     CCCP  612.171540
19    O14548                          [R].gLPDQMLYr.[T]     CCCP   46.449990
20    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP    6.016590
21    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP    0.466220
22    O14548                          [R].gLPDQMLYr.[T]     CCCP  586.392482
23    O14548                          [R].gLPDQMLYr.[T]     CCCP  303.857624

While for the next comparison it looks okay:而对于下一个比较,它看起来还不错:

['CCCP', 'DMSO']
CCCP DMSO
   Accession                                   Sequence variable        value
0     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    39.300171
1     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   132.637125
2     O14548                          [R].gLPDQMLYr.[T]     DMSO  1165.245826
3     O14548                          [R].gLPDQMLYr.[T]     DMSO   642.971908
4     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    83.906058
5     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   160.718841
6     O14548                          [R].gLPDQMLYr.[T]     DMSO  1240.856710
7     O14548                          [R].gLPDQMLYr.[T]     DMSO   557.508092
8     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    56.228425
9     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   302.346775
10    O14548                          [R].gLPDQMLYr.[T]     DMSO  1176.998098
11    O14548                          [R].gLPDQMLYr.[T]     DMSO   766.993819
12    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.387985
13    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.175678
14    O14548                          [R].gLPDQMLYr.[T]     CCCP   885.174420
15    O14548                          [R].gLPDQMLYr.[T]     CCCP   130.458963
16    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.557088
17    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.095801
18    O14548                          [R].gLPDQMLYr.[T]     CCCP   612.171540
19    O14548                          [R].gLPDQMLYr.[T]     CCCP    46.449990
20    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     6.016590
21    O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     CCCP     0.466220
22    O14548                          [R].gLPDQMLYr.[T]     CCCP   586.392482
23    O14548                          [R].gLPDQMLYr.[T]     CCCP   303.857624

For the third it looks weird again:第三个看起来又很奇怪:

['C+I', 'DMSO']
['C+I', 'DMSO']
C+I DMSO
   Accession                                   Sequence variable        value
0     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    39.300171
1     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   132.637125
2     O14548                          [R].gLPDQMLYr.[T]     DMSO  1165.245826
3     O14548                          [R].gLPDQMLYr.[T]     DMSO   642.971908
4     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    83.906058
5     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   160.718841
6     O14548                          [R].gLPDQMLYr.[T]     DMSO  1240.856710
7     O14548                          [R].gLPDQMLYr.[T]     DMSO   557.508092
8     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO    56.228425
9     O14548  [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L]     DMSO   302.346775
10    O14548                          [R].gLPDQMLYr.[T]     DMSO  1176.998098
11    O14548                          [R].gLPDQMLYr.[T]     DMSO   766.993819

I am using the same code for approx.我使用相同的代码大约。 5000 different dataframes and it always works. 5000 个不同的数据帧,它总是有效的。 The conditions are named exactly the same, but somehow it breaks in some cases.条件的名称完全相同,但在某些情况下会以某种方式中断。

Can anybody please help?有人可以帮忙吗?

You can add regex=False parameter for avoid convert values to regex in Series.str.contains :您可以添加regex=False参数以避免将值转换为Series.str.contains中的正则表达式:

melted_Peptides['variable'].str.contains(pair[0], regex=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM