如何根据序列中的大写字母提取起点和终点？

Question

I would like to extract the start and end site information that is in capital letter. 我想提取大写字母中的开始和结束站点信息。 By counting the sequence length using the code below is not able to return the sequence information accurately. 通过使用下面的代码计算序列长度，无法准确返回序列信息。 The P-match result that I need to process given the start site is based on the first alphabet but the start site that I actually need is the first capital letter that occur in every site. 给定起始站点，我需要处理的P匹配结果基于第一个字母，但是我实际需要的起始站点是每个站点中出现的第一个大写字母。 How can I retrieve the accurate start and end site? 如何检索准确的起点和终点？ Can anyone help me? 谁能帮我？

Text file A.txt 文本文件A.txt

Scanning sequence ID:   BEST1_HUMAN

          150 (-)  1.000  0.997  GGAAAggccc                                   R05891
          354 (+)  0.988  0.981  gtgtAGACAtt                                  R06227
V$CREL_01c-RelV$EVI1_05Evi-1

Scanning sequence ID:   4F2_HUMAN

          365 (+)  1.000  1.000  gggacCTACA                                   R05884
           789 (-)  1.000  1.000  gcgCGAAA                                       R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F

Expected output: 预期产量：

Sequence ID start end 序列ID开始结束

BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN   370 370
4F2_HUMAN   792 797

File B.txt 文件B.txt

Scanning sequence ID: hg17_ct_ER_ER_142

              512 (-)  0.988  0.981  taTAGCTaagc                        Evi-1          R06227
V$EVI1_05

Scanning sequence ID: hg17_ct_ER_ER_1

              213 (-)  1.000  0.989  aggggcaggGGTCA                     COUP-TF, HNF-4 R07445
V$COUP_01

Expected output: 预期产量：

hg17_ct_ER_ER_142 514 519
hg17_ct_ER_ER_1 222 227

Example code: 示例代码：

output_file = open('output.bed','w')
with open('A.txt') as f:
    text = f.read()
    chunks = text.split('Scanning sequence ID:')
    for chunk in chunks:
        if chunk:
            lines = chunk.split('\n')
            sequence_id = lines[0].strip()
            for line in lines:
                if line.startswith('              '):
                    start = int(line.split()[0].strip())
                    sequence = line.split()[-2].strip()
                    stop = start + len(sequence)
                    #print sequence_id, start, stop
                    seq='%s\t%i\t%i\n' % \
                         (sequence_id,start,stop)
                    output_file.write(seq)
output_file.close()

Answer 1

This code will get the label and start values: 此代码将获取标签和起始值：

import re

p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)"

with open("A.txt", "r") as f:
    s = f.read()

re.findall(p,s, re.DOTALL)

Sample output: 样本输出：

[('BEST1_HUMAN', '150'), ('4F2_HUMAN', '365')]

Then there's the calculation of the second number ("end site"). 然后是第二个数字的计算（“最终站点”）。 In the code in the opening post I see: sequence = line.split()[-2].strip(); stop = start + len(sequence) 在开篇文章的代码中，我看到： sequence = line.split()[-2].strip(); stop = start + len(sequence) sequence = line.split()[-2].strip(); stop = start + len(sequence) . sequence = line.split()[-2].strip(); stop = start + len(sequence) 。 Hence I would conclude thatyou want to increment the value start with the string length of the second last column (GGAAAggccc etc.). 因此，我得出的结论是，您要从最后第二列的字符串长度（GGAAAggccc等）开始增加值。

I can capture that column as well, using the following modified regexp: 我还可以使用以下修改的regexp捕获该列：

p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)\s+\S+\s+\S+\s+\S+\s+(?P<sequence>\S+)"
re.findall(p,s, re.DOTALL)

Sample output: 样本输出：

[('BEST1_HUMAN', '150', 'GGAAAggccc'), ('4F2_HUMAN', '365', 'gggacCTACA')]

Now we want to handle the situation where one label has more than one data line. 现在，我们要处理一个标签具有多条数据线的情况。 For this, we need to drop re.findall and go to an iteration: 为此，我们需要删除re.findall并进行迭代：

import re
with open("A.txt", "r") as f:
    lines = f.readlines()

label_ptrn = re.compile("^Scanning sequence ID\\:\\s*(?P<label>[A-Z0-9]+\\_[A-Z0-9]+)$")
line_ptrn = re.compile("^\s+(?P<start_value>\\d+)\\s+\\S+\\s+\\S+\\s+\\S+\\s+(?P<sequence>\\S+).*$")
inner_ptrn = re.compile("[A-Z]+")

all_matches = []
for line in lines:
    m = label_ptrn.match(line)
    if m:
        label = m.groupdict().get("label")
        continue
    m = line_ptrn.match(line)
    if m:
        start = m.groupdict().get("start_value")
        sequence = m.groupdict().get("sequence")
        mi = inner_ptrn.search(sequence)
        if not mi:
            continue
        span = mi.span()
        all_matches.append((label, int(start)+span[0], int(start)+span[1]))

Then you can print the matches as follows: 然后，您可以按以下方式打印匹配项：

with open("output.bed", "w+b") as f:
    for m in all_matches:
        f.write('%s\t%i\t%i\n' % m)

Sample output: 样本输出：

BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN   370 375
4F2_HUMAN   792 797

I think the problem is solved ;) 我认为问题已经解决;）

如何根据序列中的大写字母提取起点和终点？

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-06-19 07:16:05

如何根据序列中的大写字母提取起点和终点？

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-06-19 07:16:05

解决方案1
2 已采纳 2015-06-19 07:16:05