简体   繁体   English

正则表达式仅在不匹配其他模式时才匹配特定模式

[英]Regex for matching a specific pattern only if it doesn't match other pattern

I need to create a matching regex to find genetic sequences and I got stuck behind one specific problem - after first, start codon ATG , follows other codons from three nucleotides as well and the regex ends with three possible codons TAA , TAG and TGA . 我需要创建一个匹配的正则表达式来查找遗传序列,但遇到一个特定的问题-首先,启动密码子ATG ,跟随三个核苷酸的其他密码子,并且正则表达式以三个可能的密码子TAATAGTGA结尾。 What if the stop(end) codon goes after the start( ATG ) codon? 如果停止(结束)密码进入开始(后ATG )密码子? My current regex works when there are intermediate codons between start and stop codon, but if there are none, the regex matches ALL of the sequence after start codon. 当起始密码子和终止密码子之间存在中间密码子时,我当前的正则表达式将起作用,但是如果没有中间密码子,则正则表达式将匹配起始密码子后的所有序列。 I know why it does that, but I have no idea how to change it to work the way I want it to. 我知道为什么会这样做,但是我不知道如何更改它以使其达到我想要的方式。

My regex should look for AGGAGG (exactly this pattern), then A , C , G or T (from 4 to 12 times) then ATG (exactly this pattern), then A , C , G or T (in triples (for example, ACG , TGC and etc.), doesn't matter how long) UNTIL it matches TAA , TAG or TGA . 我的正则表达式应先查找AGGAGG (正是这种模式),然后是ACGT (从4到12倍),然后是ATG (正是这种模式),然后是ACGT (三倍)(例如, ACGTGC等),无论多长时间(直到与TAATAGTGA匹配)。 The search should end after that and start again after that. 搜索应在此之后结束,然后再开始。

Example of a good match: 良好匹配的示例:

XXXXXXXXXXXXXXXXXXXXXXXXX   XXXXXXXXXXXXXXXX
AGGAGGTATGATGCGTACGGGCTAGTAGAGGAGGTATGATGTAGTAGCATGCT

There are two matches in the sequence - from 0 to 25 and from 28 to 44. 序列中有两个匹配项-从0到25和从28到44。

My current regex(don't mind the first two brackets): 我当前的正则表达式(不要介意前两个括号):

$seq =~ /(AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3,3}){0,}(TAA|TAG|TGA)/ig

Problem here comes from the default usage of greedy quantifiers. 这里的问题来自贪婪量词的默认用法。

When using (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*(TAA|TAG|TGA) , 4th group ([ACTG]{3})* will match as many as possible, then only 5th group is considered (backtracking if needed). 使用(AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*(TAA|TAG|TGA) ,第四组([ACTG]{3})*将匹配尽可能只考虑第5组(必要时回溯)。
In your sequence you get TAGTAG . 在您的序列中,您得到TAGTAG Greedy quantifier will lead to first TAG being captured in group 4, and second one captured as ending group. 贪婪的量词将导致第一个TAG被捕获在第4组中,第二个TAG被捕获为结束组。

You may use lazy quantifier instead: (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*?(TAA|TAG|TGA) (note the added question mark, making the quantifier lazy). 您可以改用惰性量词: (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*?(TAA|TAG|TGA) (注意添加的问号,使量词成为懒)。
That way, first TAG encountered will be treated as the ending group. 这样,遇到的第一个TAG将被视为结束组。

Demo . 演示

According to the pattern you gave, you could have overlapping matches. 根据您提供的模式,可能会有重叠的匹配项。 The following will find all matches, including overlapping matches: 以下将找到所有匹配项,包括重叠的匹配项:

local our @matches;
$seq =~ /
   (
   ( AGGAGG )
   ( [ACGT]{4,12} )
   ( ATG )
   ( (?: (?! TAA|TAG|TGA ) [ACTG]{3} )* )
   ( TAA|TAG|TGA )
   )
   (?{ push @matches, [ $-[1], $1, $2, $3, $4, $5, $6 ] })
   (?!)
/xg;

Perl essential regex feature, as opposed to plain regex like grep, is the lazy quantifier: ? 与grep之类的普通正则表达式相反,Perl基本正则表达式功能是懒惰的量词: following the * or + quantifier. 跟随*或+量词。 it matches zero (one) or more occurrence of the character preceding * (+) token as the shortest glob match as possible 它匹配*(+)标记之前的零(一)或更多字符出现,因为这是最短的glob匹配

$seq =~ /((AGGAGG)([ACGT]{4,12})(ATG)([ACGT]{3})*?(TAA|TAG|TGA))/igx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM