简体   繁体   English

正则表达式Python findall。 使事情变得多余

[英]Regex Python findall. Making things nonredundant

So what I'm trying to do is have a function that finds a sequence 'ATG' in a string and then from there moves along the string in units of 3 until it finds either a 'TAA', 'TAG', or 'TGA' (ATG-xxx-xxx-TAA|TAG|TGA) 所以我想做的是有一个函数,该函数在字符串中找到一个序列“ ATG”,然后从那里沿字符串以3为单位移动,直到找到“ TAA”,“ TAG”或“ TGA” '(ATG-xxx-xxx-TAA | TAG | TGA)

To do this, I wrote this line of code (where fdna is the input sequence) 为此,我编写了以下代码行(其中fdna是输入序列)

ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

I then wanted to add 3 requirements: 然后,我想添加3个要求:

  1. Total length must be 30 总长度必须为30
  2. Two places before the ATG there must be either an A or a G to be detected (A|GxxATGxxx) 在ATG之前的两个位置必须检测到A或G(A | GxxATGxxx)
  3. The next place after the ATG would have to be a G (ATGGxx) ATG之后的下一个位置必须是G(ATGGxx)

To execute this part, I changed my code to: 为了执行此部分,我将代码更改为:

ORF_sequence_finder = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)

What I want instead of having all of these limits would be to have requirement 1 (greater or equal to 30 characters) and then have EITHER requirement 2 (A|GxxATGxxx) OR requirement 3 (ATGGxx) OR both of those. 我想要的不是所有这些限制,而是要有要求1(大于或等于30个字符),然后又有要求2(A | GxxATGxxx)或要求3(ATGGxx)或两者兼而有之。

If I split the above line up into two and appended them to a list, they get out of order and have repeats. 如果我将上面的行分成两部分并将它们附加到列表中,它们会混乱并且重复。

Here are a few examples of the different cases: 以下是几种不同情况的示例:

sequence1 = 'AGCCATGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGAAAA'
sequence2 = 'ATCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'
sequence3 = 'AGCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'    
sequence4 = 'ATGGGGTGA'

sequence1 = 'A**G**CC*ATG*TGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'

sequence1 would be accepted by criteria because it follows requirement 2 (A|GxxATGxxx) and its length is >= 30. sequence1将被标准接受,因为它遵循要求2(A | GxxATGxxx),并且长度> = 30。

sequence2 = 'ATCC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TAG*

sequence2 would be accepted because it follows requirement 3 (ATGGxx) and its length is >=30 可以接受sequence2 ,因为它遵循要求3(ATGGxx),并且长度> = 30

sequence3 = 'A**G**CC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'

sequence3 would be accepted because it fulfills both requirement 2 and 3 while also having >=30 character. 可以接受sequence3 ,因为它既满足要求2和3,又具有> = 30个字符。

sequence4 = 'ATGGGGTGA'

sequence4 would NOT be accepted because its not >= 30, does not follow requirement 2 or requirement 3. 不能接受sequence4 ,因为其不大于等于30,不符合要求2或要求3。

So basically, I want it to accept sequences that either follow requirement 2 AND/OR requirement 3 (or both) while satisfying requirement 1. 因此,基本上,我希望它接受既满足要求2和/或要求3(或两者都满足)又满足要求1的序列。

How can I split this up without then adding duplicates (in cases where both occur) and getting out of order? 我该如何拆分,而又不添加重复项(如果两种情况都发生)并且变得混乱?

If the possible [AG].. should be included in the length requirement you can use: 如果长度要求中应包括可能的[AG].. ,则可以使用:

r'(?x) (?: [AG].. ATG | ATG G.. )  (?:...){7,}? (?:TAA|TAG|TGA)'

Or if you don't want to include [AG].. in the match you could use lookarounds: 或者,如果您不想在比赛中加入[AG].. ,则可以使用环顾四周:

r'(?x) ATG (?: (?<=[AG].. ATG) | (?=G) ) (?:...){8,}? (?:TAA|TAG|TGA)'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM