重命名gffile中的名称ID。

Question

我有一个gff文件，看起来像这样：

contig1 loci    gene    452050  453069  15  -   .   ID=dd_g4_1G94;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci    exon    452050  452543  .   -   .   ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###

我希望重命名ID名称，从0001开始，这样对于上述基因，该条目为：

contig1 loci    gene    452050  453069  15  -   .   ID=dd_0001;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_0001.1;Parent=dd_0001
contig1 loci    exon    452050  452543  .   -   .   ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_0001.2.exon3;Parent=dd_0001.2

上面的示例仅是一个基因输入，但是我希望重命名所有基因及其对应的mRNA /外显子，从ID = dd_0001开始。 任何有关如何执行此操作的提示将不胜感激。

Answer 1

需要打开文件，然后将ID逐行替换。
这是文件I / O和str.replace（）的文档参考。

gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'

lines = []
with open(gff_filename, 'r') as gff_file:
    for line in gff_file:
        line = line.replace(replace_string, replace_with)
        lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

在Windows 10，Python 3.5.1中进行了测试，此方法有效。

要搜索ID，您应该使用regex 。

import re

gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;\.]'

ids  = []
lines = []
with open(gff_filename, 'r') as gff_file:
    file_lines = [line for line in gff_file]

for line in file_lines:
    matches = re.findall(re_pattern, line)
    for found_id in matches:
        if found_id not in ids:
            ids.append(found_id)

for line in file_lines:
    for ID in ids:
        if ID in line:
            id_suffix = str(ids.index(ID)).zfill(4)
            line = line.replace(ID, replace_with.format(id_suffix))
    lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

还有其他方法可以执行此操作，但这是相当可靠的。

重命名gffile中的名称ID。

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-03-23 18:05:11

重命名gffile中的名称ID。

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-03-23 18:05:11

解决方案1
1 已采纳 2017-03-23 18:05:11