简体   繁体   English

Python:使用 Bed 文件从 FASTA 文件中提取 DNA 序列

[英]Python: Extract DNA sequence from FASTA file using Bed file

May I know how can I extract dna sequence from fasta file?我可以知道如何从 fasta 文件中提取 dna 序列吗? I tried bedtools and samtools.我尝试了床上用品和 samtools。 Bedtools getfasta did well but for some of my file return "warning: chromosome was not found in fasta file" but the fact is the chromosome name in bed file and fasta are exactly the same. Bedtools getfasta 做得很好,但对于我的一些文件返回“警告:在 fasta 文件中找不到染色体”但事实是床文件中的染色体名称和 fasta 完全相同。 I'm looking for other alternative that python can do this task for me.我正在寻找 python 可以为我完成这项任务的其他替代方案。

Bed file:床档:
chr1:117223140-117223856 3 7 chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9 chr1:117223140-117223856 5 9

Fasta file:法斯塔文件:
>chr1:117223140-117223856 >chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC CGCGTGGGCTAGGGGCTAGCCCC

Desired output:期望的输出:
>chr1:117223140-117223856 >chr1:117223140-117223856
CGTGG CGTG
>chr1:117223140-117223856 >chr1:117223140-117223856
TGGGC TGGGC

BioPython is what you want to use: BioPython是你想要使用的:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict

# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
    for line in f:
        name, start, stop = line.split()
        positions[name].append((int(start), int(stop)))

# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))

# search for short sequences
short_seq_records = []
for name in positions:
    for (start, stop) in positions[name]:
        long_seq_record = records[name]
        long_seq = long_seq_record.seq
        alphabet = long_seq.alphabet
        short_seq = str(long_seq)[start-1:stop]
        short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
        short_seq_records.append(short_seq_record)

# write to file
with open('output.fasta', 'w') as f:
    SeqIO.write(short_seq_records, f, 'fasta')

try, with:尝试,与:

from Bio import SeqIO

#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])

output = open("output.fasta", "w")
for line in open("input.bed"):
  id, begin, end = line.split()
  if id in dict_fasta:
    #[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
    #[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
    output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
  else:
    print id + " don't found"

output.close()

you get, first base in a chromosome is numbered 1:你知道,染色体中的第一个碱基编号为 1:

>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC

you get, first base in a chromosome is numbered 0:你知道,染色体中的第一个碱基编号为 0:

>chr1:117223140-117223856
GTGGG
>chr1:117223140-117223856
GGGCT

Your bed file needs to be tab-delimited for bedtools to use it.您的床铺文件需要以制表符分隔,床铺工具才能使用它。 Replace your colons, dashes, and spaces with a tab.用制表符替换冒号、破折号和空格。

The BedTools doc page says "bedtools requires that all BED input files (and input received from stdin) are tab-delimited." BedTools 文档页面说“bedtools 要求所有 BED 输入文件(以及从标准输入接收的输入)都以制表符分隔。” BedTools . 床具

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM