简体   繁体   English

如何根据序列中的大写字母提取起点和终点?

[英]How to extract start and end sites based on capital letter in a sequence?

I would like to extract the start and end site information that is in capital letter. 我想提取大写字母中的开始和结束站点信息。 By counting the sequence length using the code below is not able to return the sequence information accurately. 通过使用下面的代码计算序列长度,无法准确返回序列信息。 The P-match result that I need to process given the start site is based on the first alphabet but the start site that I actually need is the first capital letter that occur in every site. 给定起始站点,我需要处理的P匹配结果基于第一个字母,但是我实际需要的起始站点是每个站点中出现的第一个大写字母。 How can I retrieve the accurate start and end site? 如何检索准确的起点和终点? Can anyone help me? 谁能帮我?

Text file A.txt 文本文件A.txt

Scanning sequence ID:   BEST1_HUMAN

          150 (-)  1.000  0.997  GGAAAggccc                                   R05891
          354 (+)  0.988  0.981  gtgtAGACAtt                                  R06227
V$CREL_01c-RelV$EVI1_05Evi-1

Scanning sequence ID:   4F2_HUMAN

          365 (+)  1.000  1.000  gggacCTACA                                   R05884
           789 (-)  1.000  1.000  gcgCGAAA                                       R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F

Expected output: 预期产量:

Sequence ID start end 序列ID开始结束

BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN   370 370
4F2_HUMAN   792 797

File B.txt 文件B.txt

Scanning sequence ID: hg17_ct_ER_ER_142

              512 (-)  0.988  0.981  taTAGCTaagc                        Evi-1          R06227
V$EVI1_05

Scanning sequence ID: hg17_ct_ER_ER_1

              213 (-)  1.000  0.989  aggggcaggGGTCA                     COUP-TF, HNF-4 R07445
V$COUP_01

Expected output: 预期产量:

hg17_ct_ER_ER_142 514 519
hg17_ct_ER_ER_1 222 227

Example code: 示例代码:

output_file = open('output.bed','w')
with open('A.txt') as f:
    text = f.read()
    chunks = text.split('Scanning sequence ID:')
    for chunk in chunks:
        if chunk:
            lines = chunk.split('\n')
            sequence_id = lines[0].strip()
            for line in lines:
                if line.startswith('              '):
                    start = int(line.split()[0].strip())
                    sequence = line.split()[-2].strip()
                    stop = start + len(sequence)
                    #print sequence_id, start, stop
                    seq='%s\t%i\t%i\n' % \
                         (sequence_id,start,stop)
                    output_file.write(seq)
output_file.close()

This code will get the label and start values: 此代码将获取标签和起始值:

import re

p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)"

with open("A.txt", "r") as f:
    s = f.read()

re.findall(p,s, re.DOTALL)

Sample output: 样本输出:

[('BEST1_HUMAN', '150'), ('4F2_HUMAN', '365')]

Then there's the calculation of the second number ("end site"). 然后是第二个数字的计算(“最终站点”)。 In the code in the opening post I see: sequence = line.split()[-2].strip(); stop = start + len(sequence) 在开篇文章的代码中,我看到: sequence = line.split()[-2].strip(); stop = start + len(sequence) sequence = line.split()[-2].strip(); stop = start + len(sequence) . sequence = line.split()[-2].strip(); stop = start + len(sequence) Hence I would conclude thatyou want to increment the value start with the string length of the second last column (GGAAAggccc etc.). 因此,我得出的结论是,您要从最后第二列的字符串长度(GGAAAggccc等) 开始增加值。

I can capture that column as well, using the following modified regexp: 我还可以使用以下修改的regexp捕获该列:

p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)\s+\S+\s+\S+\s+\S+\s+(?P<sequence>\S+)"
re.findall(p,s, re.DOTALL)

Sample output: 样本输出:

[('BEST1_HUMAN', '150', 'GGAAAggccc'), ('4F2_HUMAN', '365', 'gggacCTACA')]

Now we want to handle the situation where one label has more than one data line. 现在,我们要处理一个标签具有多条数据线的情况。 For this, we need to drop re.findall and go to an iteration: 为此,我们需要删除re.findall并进行迭代:

import re
with open("A.txt", "r") as f:
    lines = f.readlines()

label_ptrn = re.compile("^Scanning sequence ID\\:\\s*(?P<label>[A-Z0-9]+\\_[A-Z0-9]+)$")
line_ptrn = re.compile("^\s+(?P<start_value>\\d+)\\s+\\S+\\s+\\S+\\s+\\S+\\s+(?P<sequence>\\S+).*$")
inner_ptrn = re.compile("[A-Z]+")

all_matches = []
for line in lines:
    m = label_ptrn.match(line)
    if m:
        label = m.groupdict().get("label")
        continue
    m = line_ptrn.match(line)
    if m:
        start = m.groupdict().get("start_value")
        sequence = m.groupdict().get("sequence")
        mi = inner_ptrn.search(sequence)
        if not mi:
            continue
        span = mi.span()
        all_matches.append((label, int(start)+span[0], int(start)+span[1]))

Then you can print the matches as follows: 然后,您可以按以下方式打印匹配项:

with open("output.bed", "w+b") as f:
    for m in all_matches:
        f.write('%s\t%i\t%i\n' % m)

Sample output: 样本输出:

BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN   370 375
4F2_HUMAN   792 797

I think the problem is solved ;) 我认为问题已经解决;)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何找到以大写字母开头的字符串中的单词? - how can i find the words in a string that start with capital letter? 如何根据开始和结束列表提取文本 - How to extract the text between based on start and end list 如何通过映射每个大写字母仅提取括号内首字母缩略词后的缩写 - How do i extract only abbreviation following acronyms inside the brackets by mapping each Capital letter 如何使用正则表达式提取第二个大写字母后的所有文本(数字、字母、符号)? - How do I extract with regex all the text (numbers, letters, symbols) after the second capital letter? 如何从 50 个字符长且由 A 到 Z 和 0-9 组成的文本中提取字符串 .. 以大写字母开头 - How to extract a string from text that is 50 chars long and consists of A to Z and 0-9 .. starts with a capital letter 如何使用正则表达式缩写所有以大写字母开头的单词 - How can I use Regex to abbreviate words that all start with a capital letter 如何根据最后出现的小写字母和大写字母来分隔数据框中的句子 - How to seperate sentences in a dataframe based on last occurence of small letter followed by a capital one 如何根据步长提取短序列? - How to extract short sequence based on step size? 如何在大写字母前添加新行? - How to add a new line before a capital letter? pyside-如何捕获大写字母(KeyEvent)? - pyside - how to capture capital letter (KeyEvent)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM