[英]Regex Python findall. Making things nonredundant
So what I'm trying to do is have a function that finds a sequence 'ATG' in a string and then from there moves along the string in units of 3 until it finds either a 'TAA', 'TAG', or 'TGA' (ATG-xxx-xxx-TAA|TAG|TGA) 所以我想做的是有一个函数,该函数在字符串中找到一个序列“ ATG”,然后从那里沿字符串以3为单位移动,直到找到“ TAA”,“ TAG”或“ TGA” '(ATG-xxx-xxx-TAA | TAG | TGA)
To do this, I wrote this line of code (where fdna
is the input sequence) 为此,我编写了以下代码行(其中
fdna
是输入序列)
ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
I then wanted to add 3 requirements: 然后,我想添加3个要求:
To execute this part, I changed my code to: 为了执行此部分,我将代码更改为:
ORF_sequence_finder = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)
What I want instead of having all of these limits would be to have requirement 1 (greater or equal to 30 characters) and then have EITHER requirement 2 (A|GxxATGxxx) OR requirement 3 (ATGGxx) OR both of those. 我想要的不是所有这些限制,而是要有要求1(大于或等于30个字符),然后又有要求2(A | GxxATGxxx)或要求3(ATGGxx)或两者兼而有之。
If I split the above line up into two and appended them to a list, they get out of order and have repeats. 如果我将上面的行分成两部分并将它们附加到列表中,它们会混乱并且重复。
Here are a few examples of the different cases: 以下是几种不同情况的示例:
sequence1 = 'AGCCATGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGAAAA'
sequence2 = 'ATCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'
sequence3 = 'AGCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'
sequence4 = 'ATGGGGTGA'
sequence1 = 'A**G**CC*ATG*TGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'
sequence1
would be accepted by criteria because it follows requirement 2 (A|GxxATGxxx) and its length is >= 30. sequence1
将被标准接受,因为它遵循要求2(A | GxxATGxxx),并且长度> = 30。
sequence2 = 'ATCC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TAG*
sequence2
would be accepted because it follows requirement 3 (ATGGxx) and its length is >=30 可以接受
sequence2
,因为它遵循要求3(ATGGxx),并且长度> = 30
sequence3 = 'A**G**CC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'
sequence3
would be accepted because it fulfills both requirement 2 and 3 while also having >=30 character. 可以接受
sequence3
,因为它既满足要求2和3,又具有> = 30个字符。
sequence4 = 'ATGGGGTGA'
sequence4
would NOT be accepted because its not >= 30, does not follow requirement 2 or requirement 3. 不能接受
sequence4
,因为其不大于等于30,不符合要求2或要求3。
So basically, I want it to accept sequences that either follow requirement 2 AND/OR requirement 3 (or both) while satisfying requirement 1. 因此,基本上,我希望它接受既满足要求2和/或要求3(或两者都满足)又满足要求1的序列。
How can I split this up without then adding duplicates (in cases where both occur) and getting out of order? 我该如何拆分,而又不添加重复项(如果两种情况都发生)并且变得混乱?
If the possible [AG]..
should be included in the length requirement you can use: 如果长度要求中应包括可能的
[AG]..
,则可以使用:
r'(?x) (?: [AG].. ATG | ATG G.. ) (?:...){7,}? (?:TAA|TAG|TGA)'
Or if you don't want to include [AG]..
in the match you could use lookarounds: 或者,如果您不想在比赛中加入
[AG]..
,则可以使用环顾四周:
r'(?x) ATG (?: (?<=[AG].. ATG) | (?=G) ) (?:...){8,}? (?:TAA|TAG|TGA)'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.