[英]How to extract start and end sites based on capital letter in a sequence?
我想提取大寫字母中的開始和結束站點信息。 通過使用下面的代碼計算序列長度,無法准確返回序列信息。 給定起始站點,我需要處理的P匹配結果基於第一個字母,但是我實際需要的起始站點是每個站點中出現的第一個大寫字母。 如何檢索准確的起點和終點? 誰能幫我?
文本文件A.txt
Scanning sequence ID: BEST1_HUMAN
150 (-) 1.000 0.997 GGAAAggccc R05891
354 (+) 0.988 0.981 gtgtAGACAtt R06227
V$CREL_01c-RelV$EVI1_05Evi-1
Scanning sequence ID: 4F2_HUMAN
365 (+) 1.000 1.000 gggacCTACA R05884
789 (-) 1.000 1.000 gcgCGAAA R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F
預期產量:
序列ID開始結束
BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN 370 370
4F2_HUMAN 792 797
文件B.txt
Scanning sequence ID: hg17_ct_ER_ER_142
512 (-) 0.988 0.981 taTAGCTaagc Evi-1 R06227
V$EVI1_05
Scanning sequence ID: hg17_ct_ER_ER_1
213 (-) 1.000 0.989 aggggcaggGGTCA COUP-TF, HNF-4 R07445
V$COUP_01
預期產量:
hg17_ct_ER_ER_142 514 519
hg17_ct_ER_ER_1 222 227
示例代碼:
output_file = open('output.bed','w')
with open('A.txt') as f:
text = f.read()
chunks = text.split('Scanning sequence ID:')
for chunk in chunks:
if chunk:
lines = chunk.split('\n')
sequence_id = lines[0].strip()
for line in lines:
if line.startswith(' '):
start = int(line.split()[0].strip())
sequence = line.split()[-2].strip()
stop = start + len(sequence)
#print sequence_id, start, stop
seq='%s\t%i\t%i\n' % \
(sequence_id,start,stop)
output_file.write(seq)
output_file.close()
此代碼將獲取標簽和起始值:
import re
p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)"
with open("A.txt", "r") as f:
s = f.read()
re.findall(p,s, re.DOTALL)
樣本輸出:
[('BEST1_HUMAN', '150'), ('4F2_HUMAN', '365')]
然后是第二個數字的計算(“最終站點”)。 在開篇文章的代碼中,我看到: sequence = line.split()[-2].strip(); stop = start + len(sequence)
sequence = line.split()[-2].strip(); stop = start + len(sequence)
。 因此,我得出的結論是,您要從最后第二列的字符串長度(GGAAAggccc等) 開始增加值。
我還可以使用以下修改的regexp捕獲該列:
p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)\s+\S+\s+\S+\s+\S+\s+(?P<sequence>\S+)"
re.findall(p,s, re.DOTALL)
樣本輸出:
[('BEST1_HUMAN', '150', 'GGAAAggccc'), ('4F2_HUMAN', '365', 'gggacCTACA')]
現在,我們要處理一個標簽具有多條數據線的情況。 為此,我們需要刪除re.findall
並進行迭代:
import re
with open("A.txt", "r") as f:
lines = f.readlines()
label_ptrn = re.compile("^Scanning sequence ID\\:\\s*(?P<label>[A-Z0-9]+\\_[A-Z0-9]+)$")
line_ptrn = re.compile("^\s+(?P<start_value>\\d+)\\s+\\S+\\s+\\S+\\s+\\S+\\s+(?P<sequence>\\S+).*$")
inner_ptrn = re.compile("[A-Z]+")
all_matches = []
for line in lines:
m = label_ptrn.match(line)
if m:
label = m.groupdict().get("label")
continue
m = line_ptrn.match(line)
if m:
start = m.groupdict().get("start_value")
sequence = m.groupdict().get("sequence")
mi = inner_ptrn.search(sequence)
if not mi:
continue
span = mi.span()
all_matches.append((label, int(start)+span[0], int(start)+span[1]))
然后,您可以按以下方式打印匹配項:
with open("output.bed", "w+b") as f:
for m in all_matches:
f.write('%s\t%i\t%i\n' % m)
樣本輸出:
BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN 370 375
4F2_HUMAN 792 797
我認為問題已經解決;)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.