[英]Pandas str.contains() not working in some cases
I have a script, that does linear modelling between pairs of conditions: The dataframe looks like this:我有一个脚本,可以在成对条件之间进行线性建模:dataframe 看起来像这样:
Accession Sequence variable value
0 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 39.300171
1 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 132.637125
2 O14548 [R].gLPDQMLYr.[T] DMSO 1165.245826
3 O14548 [R].gLPDQMLYr.[T] DMSO 642.971908
4 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 83.906058
5 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 160.718841
6 O14548 [R].gLPDQMLYr.[T] DMSO 1240.856710
7 O14548 [R].gLPDQMLYr.[T] DMSO 557.508092
8 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 56.228425
9 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 302.346775
10 O14548 [R].gLPDQMLYr.[T] DMSO 1176.998098
11 O14548 [R].gLPDQMLYr.[T] DMSO 766.993819
12 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.387985
13 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.175678
14 O14548 [R].gLPDQMLYr.[T] CCCP 885.174420
15 O14548 [R].gLPDQMLYr.[T] CCCP 130.458963
16 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.557088
17 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.095801
18 O14548 [R].gLPDQMLYr.[T] CCCP 612.171540
19 O14548 [R].gLPDQMLYr.[T] CCCP 46.449990
20 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 6.016590
21 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.466220
22 O14548 [R].gLPDQMLYr.[T] CCCP 586.392482
23 O14548 [R].gLPDQMLYr.[T] CCCP 303.857624
24 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] C+I 44.627773
25 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] C+I 0.841494
26 O14548 [R].gLPDQMLYr.[T] C+I 632.355914
27 O14548 [R].gLPDQMLYr.[T] C+I 162.333292
28 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] C+I 12.075158
29 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] C+I 154.253098
30 O14548 [R].gLPDQMLYr.[T] C+I 159.767999
31 O14548 [R].gLPDQMLYr.[T] C+I 1031.399087
32 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] C+I 150.724386
33 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] C+I 260.684163
34 O14548 [R].gLPDQMLYr.[T] C+I 141.459156
35 O14548 [R].gLPDQMLYr.[T] C+I 262.659208
I now want to fit a linear model for each pair.我现在想为每对安装一个线性 model。 I get the pairs by the following code:
我通过以下代码获得对:
def tessa(source):
result = []
for p1 in range(len(source)):
for p2 in range(p1+1,len(source)):
result.append([source[p1],source[p2]])
return result
unique_conditions = list(set(conditions))
pairs = tessa(unique_conditions)
print(pairs)
I am looping over the pairs and filtering by dataframe for the conditions:我正在循环这些对并通过 dataframe 过滤条件:
for pair in pairs:
pair.sort()
print(pair)
print(pair[0],pair[1])
temp=melted_Peptides[(melted_Peptides['variable'].str.contains(pair[0]))|(melted_Peptides['variable'].str.contains(pair[1]))]
print(temp)
Here comes the problem.问题来了。 It does not filter correctly.The output of this:
它没有正确过滤。这个的output:
['C+I', 'CCCP']
C+I CCCP
Accession Sequence variable value
12 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.387985
13 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.175678
14 O14548 [R].gLPDQMLYr.[T] CCCP 885.174420
15 O14548 [R].gLPDQMLYr.[T] CCCP 130.458963
16 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.557088
17 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.095801
18 O14548 [R].gLPDQMLYr.[T] CCCP 612.171540
19 O14548 [R].gLPDQMLYr.[T] CCCP 46.449990
20 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 6.016590
21 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.466220
22 O14548 [R].gLPDQMLYr.[T] CCCP 586.392482
23 O14548 [R].gLPDQMLYr.[T] CCCP 303.857624
While for the next comparison it looks okay:而对于下一个比较,它看起来还不错:
['CCCP', 'DMSO']
CCCP DMSO
Accession Sequence variable value
0 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 39.300171
1 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 132.637125
2 O14548 [R].gLPDQMLYr.[T] DMSO 1165.245826
3 O14548 [R].gLPDQMLYr.[T] DMSO 642.971908
4 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 83.906058
5 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 160.718841
6 O14548 [R].gLPDQMLYr.[T] DMSO 1240.856710
7 O14548 [R].gLPDQMLYr.[T] DMSO 557.508092
8 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 56.228425
9 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 302.346775
10 O14548 [R].gLPDQMLYr.[T] DMSO 1176.998098
11 O14548 [R].gLPDQMLYr.[T] DMSO 766.993819
12 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.387985
13 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.175678
14 O14548 [R].gLPDQMLYr.[T] CCCP 885.174420
15 O14548 [R].gLPDQMLYr.[T] CCCP 130.458963
16 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.557088
17 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.095801
18 O14548 [R].gLPDQMLYr.[T] CCCP 612.171540
19 O14548 [R].gLPDQMLYr.[T] CCCP 46.449990
20 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 6.016590
21 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] CCCP 0.466220
22 O14548 [R].gLPDQMLYr.[T] CCCP 586.392482
23 O14548 [R].gLPDQMLYr.[T] CCCP 303.857624
For the third it looks weird again:第三个看起来又很奇怪:
['C+I', 'DMSO']
['C+I', 'DMSO']
C+I DMSO
Accession Sequence variable value
0 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 39.300171
1 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 132.637125
2 O14548 [R].gLPDQMLYr.[T] DMSO 1165.245826
3 O14548 [R].gLPDQMLYr.[T] DMSO 642.971908
4 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 83.906058
5 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 160.718841
6 O14548 [R].gLPDQMLYr.[T] DMSO 1240.856710
7 O14548 [R].gLPDQMLYr.[T] DMSO 557.508092
8 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 56.228425
9 O14548 [K].lAGAWASEAYSPQGLkPVVSTEAPPIIFATPTk.[L] DMSO 302.346775
10 O14548 [R].gLPDQMLYr.[T] DMSO 1176.998098
11 O14548 [R].gLPDQMLYr.[T] DMSO 766.993819
I am using the same code for approx.我使用相同的代码大约。 5000 different dataframes and it always works.
5000 个不同的数据帧,它总是有效的。 The conditions are named exactly the same, but somehow it breaks in some cases.
条件的名称完全相同,但在某些情况下会以某种方式中断。
Can anybody please help?有人可以帮忙吗?
You can add regex=False
parameter for avoid convert values to regex in Series.str.contains
:您可以添加
regex=False
参数以避免将值转换为Series.str.contains
中的正则表达式:
melted_Peptides['variable'].str.contains(pair[0], regex=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.