[英]How to extract start and end sites based on capital letter in a sequence?
I would like to extract the start and end site information that is in capital letter. 我想提取大写字母中的开始和结束站点信息。 By counting the sequence length using the code below is not able to return the sequence information accurately. 通过使用下面的代码计算序列长度,无法准确返回序列信息。 The P-match result that I need to process given the start site is based on the first alphabet but the start site that I actually need is the first capital letter that occur in every site. 给定起始站点,我需要处理的P匹配结果基于第一个字母,但是我实际需要的起始站点是每个站点中出现的第一个大写字母。 How can I retrieve the accurate start and end site? 如何检索准确的起点和终点? Can anyone help me? 谁能帮我?
Text file A.txt 文本文件A.txt
Scanning sequence ID: BEST1_HUMAN
150 (-) 1.000 0.997 GGAAAggccc R05891
354 (+) 0.988 0.981 gtgtAGACAtt R06227
V$CREL_01c-RelV$EVI1_05Evi-1
Scanning sequence ID: 4F2_HUMAN
365 (+) 1.000 1.000 gggacCTACA R05884
789 (-) 1.000 1.000 gcgCGAAA R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F
Expected output: 预期产量:
Sequence ID start end 序列ID开始结束
BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN 370 370
4F2_HUMAN 792 797
File B.txt 文件B.txt
Scanning sequence ID: hg17_ct_ER_ER_142
512 (-) 0.988 0.981 taTAGCTaagc Evi-1 R06227
V$EVI1_05
Scanning sequence ID: hg17_ct_ER_ER_1
213 (-) 1.000 0.989 aggggcaggGGTCA COUP-TF, HNF-4 R07445
V$COUP_01
Expected output: 预期产量:
hg17_ct_ER_ER_142 514 519
hg17_ct_ER_ER_1 222 227
Example code: 示例代码:
output_file = open('output.bed','w')
with open('A.txt') as f:
text = f.read()
chunks = text.split('Scanning sequence ID:')
for chunk in chunks:
if chunk:
lines = chunk.split('\n')
sequence_id = lines[0].strip()
for line in lines:
if line.startswith(' '):
start = int(line.split()[0].strip())
sequence = line.split()[-2].strip()
stop = start + len(sequence)
#print sequence_id, start, stop
seq='%s\t%i\t%i\n' % \
(sequence_id,start,stop)
output_file.write(seq)
output_file.close()
This code will get the label and start values: 此代码将获取标签和起始值:
import re
p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)"
with open("A.txt", "r") as f:
s = f.read()
re.findall(p,s, re.DOTALL)
Sample output: 样本输出:
[('BEST1_HUMAN', '150'), ('4F2_HUMAN', '365')]
Then there's the calculation of the second number ("end site"). 然后是第二个数字的计算(“最终站点”)。 In the code in the opening post I see: sequence = line.split()[-2].strip(); stop = start + len(sequence)
在开篇文章的代码中,我看到: sequence = line.split()[-2].strip(); stop = start + len(sequence)
sequence = line.split()[-2].strip(); stop = start + len(sequence)
. sequence = line.split()[-2].strip(); stop = start + len(sequence)
。 Hence I would conclude thatyou want to increment the value start with the string length of the second last column (GGAAAggccc etc.). 因此,我得出的结论是,您要从最后第二列的字符串长度(GGAAAggccc等) 开始增加值。
I can capture that column as well, using the following modified regexp: 我还可以使用以下修改的regexp捕获该列:
p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)\s+\S+\s+\S+\s+\S+\s+(?P<sequence>\S+)"
re.findall(p,s, re.DOTALL)
Sample output: 样本输出:
[('BEST1_HUMAN', '150', 'GGAAAggccc'), ('4F2_HUMAN', '365', 'gggacCTACA')]
Now we want to handle the situation where one label has more than one data line. 现在,我们要处理一个标签具有多条数据线的情况。 For this, we need to drop re.findall
and go to an iteration: 为此,我们需要删除re.findall
并进行迭代:
import re
with open("A.txt", "r") as f:
lines = f.readlines()
label_ptrn = re.compile("^Scanning sequence ID\\:\\s*(?P<label>[A-Z0-9]+\\_[A-Z0-9]+)$")
line_ptrn = re.compile("^\s+(?P<start_value>\\d+)\\s+\\S+\\s+\\S+\\s+\\S+\\s+(?P<sequence>\\S+).*$")
inner_ptrn = re.compile("[A-Z]+")
all_matches = []
for line in lines:
m = label_ptrn.match(line)
if m:
label = m.groupdict().get("label")
continue
m = line_ptrn.match(line)
if m:
start = m.groupdict().get("start_value")
sequence = m.groupdict().get("sequence")
mi = inner_ptrn.search(sequence)
if not mi:
continue
span = mi.span()
all_matches.append((label, int(start)+span[0], int(start)+span[1]))
Then you can print the matches as follows: 然后,您可以按以下方式打印匹配项:
with open("output.bed", "w+b") as f:
for m in all_matches:
f.write('%s\t%i\t%i\n' % m)
Sample output: 样本输出:
BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN 370 375
4F2_HUMAN 792 797
I think the problem is solved ;) 我认为问题已经解决;)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.