[英]Renaming Name ID in gffile.
我有一个gff文件,看起来像这样:
contig1 loci gene 452050 453069 15 - . ID=dd_g4_1G94;
contig1 loci mRNA 452050 453069 14 - . ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci exon 452050 452543 . - . ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci exon 452592 453069 . - . ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci exon 452592 452691 . - . ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci exon 452729 453069 . - . ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###
我希望重命名ID名称,从0001开始,这样对于上述基因,该条目为:
contig1 loci gene 452050 453069 15 - . ID=dd_0001;
contig1 loci mRNA 452050 453069 14 - . ID=dd_0001.1;Parent=dd_0001
contig1 loci exon 452050 452543 . - . ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci exon 452592 453069 . - . ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci exon 452592 452691 . - . ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci exon 452729 453069 . - . ID=dd_0001.2.exon3;Parent=dd_0001.2
上面的示例仅是一个基因输入,但是我希望重命名所有基因及其对应的mRNA /外显子,从ID = dd_0001开始。 任何有关如何执行此操作的提示将不胜感激。
需要打开文件,然后将ID逐行替换。
这是文件I / O和str.replace()的文档参考。
gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'
lines = []
with open(gff_filename, 'r') as gff_file:
for line in gff_file:
line = line.replace(replace_string, replace_with)
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
在Windows 10,Python 3.5.1中进行了测试,此方法有效。
要搜索ID,您应该使用regex 。
import re
gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;\.]'
ids = []
lines = []
with open(gff_filename, 'r') as gff_file:
file_lines = [line for line in gff_file]
for line in file_lines:
matches = re.findall(re_pattern, line)
for found_id in matches:
if found_id not in ids:
ids.append(found_id)
for line in file_lines:
for ID in ids:
if ID in line:
id_suffix = str(ids.index(ID)).zfill(4)
line = line.replace(ID, replace_with.format(id_suffix))
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
还有其他方法可以执行此操作,但这是相当可靠的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.