正则表达式Python findall。使事情变得多余

Question

So what I'm trying to do is have a function that finds a sequence 'ATG' in a string and then from there moves along the string in units of 3 until it finds either a 'TAA', 'TAG', or 'TGA' (ATG-xxx-xxx-TAA|TAG|TGA) 所以我想做的是有一个函数，该函数在字符串中找到一个序列“ ATG”，然后从那里沿字符串以3为单位移动，直到找到“ TAA”，“ TAG”或“ TGA” '（ATG-xxx-xxx-TAA | TAG | TGA）

To do this, I wrote this line of code (where fdna is the input sequence) 为此，我编写了以下代码行（其中fdna是输入序列）

ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

I then wanted to add 3 requirements: 然后，我想添加3个要求：

Total length must be 30 总长度必须为30
Two places before the ATG there must be either an A or a G to be detected (A|GxxATGxxx) 在ATG之前的两个位置必须检测到A或G（A | GxxATGxxx）
The next place after the ATG would have to be a G (ATGGxx) ATG之后的下一个位置必须是G（ATGGxx）

To execute this part, I changed my code to: 为了执行此部分，我将代码更改为：

ORF_sequence_finder = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)

What I want instead of having all of these limits would be to have requirement 1 (greater or equal to 30 characters) and then have EITHER requirement 2 (A|GxxATGxxx) OR requirement 3 (ATGGxx) OR both of those. 我想要的不是所有这些限制，而是要有要求1（大于或等于30个字符），然后又有要求2（A | GxxATGxxx）或要求3（ATGGxx）或两者兼而有之。

If I split the above line up into two and appended them to a list, they get out of order and have repeats. 如果我将上面的行分成两部分并将它们附加到列表中，它们会混乱并且重复。

Here are a few examples of the different cases: 以下是几种不同情况的示例：

sequence1 = 'AGCCATGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGAAAA'
sequence2 = 'ATCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'
sequence3 = 'AGCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'    
sequence4 = 'ATGGGGTGA'

sequence1 = 'A**G**CC*ATG*TGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'

sequence1 would be accepted by criteria because it follows requirement 2 (A|GxxATGxxx) and its length is >= 30. sequence1将被标准接受，因为它遵循要求2（A | GxxATGxxx），并且长度> = 30。

sequence2 = 'ATCC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TAG*

sequence2 would be accepted because it follows requirement 3 (ATGGxx) and its length is >=30 可以接受sequence2 ，因为它遵循要求3（ATGGxx），并且长度> = 30

sequence3 = 'A**G**CC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'

sequence3 would be accepted because it fulfills both requirement 2 and 3 while also having >=30 character. 可以接受sequence3 ，因为它既满足要求2和3，又具有> = 30个字符。

sequence4 = 'ATGGGGTGA'

sequence4 would NOT be accepted because its not >= 30, does not follow requirement 2 or requirement 3. 不能接受sequence4 ，因为其不大于等于30，不符合要求2或要求3。

So basically, I want it to accept sequences that either follow requirement 2 AND/OR requirement 3 (or both) while satisfying requirement 1. 因此，基本上，我希望它接受既满足要求2和/或要求3（或两者都满足）又满足要求1的序列。

How can I split this up without then adding duplicates (in cases where both occur) and getting out of order? 我该如何拆分，而又不添加重复项（如果两种情况都发生）并且变得混乱？

Answer 1

If the possible [AG].. should be included in the length requirement you can use: 如果长度要求中应包括可能的[AG].. ，则可以使用：

r'(?x) (?: [AG].. ATG | ATG G.. )  (?:...){7,}? (?:TAA|TAG|TGA)'

Or if you don't want to include [AG].. in the match you could use lookarounds: 或者，如果您不想在比赛中加入[AG].. ，则可以使用环顾四周：

r'(?x) ATG (?: (?<=[AG].. ATG) | (?=G) ) (?:...){8,}? (?:TAA|TAG|TGA)'

正则表达式Python findall。使事情变得多余

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-03-14 05:10:37

正则表达式Python findall。 使事情变得多余

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-03-14 05:10:37

正则表达式Python findall。使事情变得多余

解决方案1
1 已采纳 2013-03-14 05:10:37