[英]Python Regex Findall Lookahead
I've created a function which searches a protein string for an open reading frame. 我创建了一个功能,可以在蛋白质字符串中搜索开放阅读框。 Here it is:
这里是:
def orf_finder(seq,format):
record = SeqIO.read(seq,format) #Reads in the sequence and tells biopython what format it is.
string = [] #creates an empty list
for i in range(3):
string.append(record.seq[i:]) #creates a list of three lists, each holding a different reading frame.
protein_string = [] #creates an empty list
protein_string.append([str(i.translate()) for i in string]) #translates each list in 'string' and combines them into one long list
regex = re.compile('M''[A-Z]'+r'*') #compiles a regular expression pattern: methionine, followed by any amino acid and ending with a stop codon.
res = max(regex.findall(str(protein_string)), key=len) #res is a string of the longest translated orf in the sequence.
print "The longest ORF (translated) is:\n\n",res,"\n"
print "The first blast result for this protein is:\n"
blast_records = NCBIXML.parse(NCBIWWW.qblast("blastp", "nr", res)) #blasts the sequence and puts the results into a 'record object'.
blast_record = blast_records.next()
counter = 0 #the counter is a method for outputting the first blast record. After it is printed, the counter equals '1' and therefore the loop stops.
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if counter < 1: #mechanism for stopping loop
print 'Sequence:', alignment.title
print 'Sength:', alignment.length
print 'E value:', hsp.expect
print 'Query:',hsp.query[0:]
print 'Match:',hsp.match[0:]
counter = 1
The only issue is, I don't think that my regex, re.compile('M''[AZ]'+r'*')
, does not find overlapping matches. 唯一的问题是,我认为我的正则表达式
re.compile('M''[AZ]'+r'*')
找不到重叠的匹配项。 I know that a lookahead clause, ?=
, might solve my problem, but I can't seem to implement it without returning an error. 我知道前瞻子句
?=
可能会解决我的问题,但是我似乎无法在不返回错误的情况下实现它。
Does anyone know how I can get it to work? 有谁知道我如何使它工作?
The code above uses biopython to read-in the DNA sequence, translate it and then searches for a protein readin frame; 上面的代码使用biopython读入DNA序列,将其翻译,然后搜索蛋白质读入框; a sequence starting with 'M' and ending with '*'.
以“ M”开头并以“ *”结尾的序列。
re.compile(r"M[A-Z]+\*")
假设您搜索的字符串以“ M”开头,然后是一个或多个大写字母“ AZ”,并以“ *”结尾。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.