Python：使用 Bed 文件从 FASTA 文件中提取 DNA 序列

Question

May I know how can I extract dna sequence from fasta file?我可以知道如何从 fasta 文件中提取 dna 序列吗？ I tried bedtools and samtools.我尝试了床上用品和 samtools。 Bedtools getfasta did well but for some of my file return "warning: chromosome was not found in fasta file" but the fact is the chromosome name in bed file and fasta are exactly the same. Bedtools getfasta 做得很好，但对于我的一些文件返回“警告：在 fasta 文件中找不到染色体”但事实是床文件中的染色体名称和 fasta 完全相同。 I'm looking for other alternative that python can do this task for me.我正在寻找 python 可以为我完成这项任务的其他替代方案。

Bed file:床档：
chr1:117223140-117223856 3 7 chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9 chr1:117223140-117223856 5 9

Fasta file:法斯塔文件：
>chr1:117223140-117223856 >chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC CGCGTGGGCTAGGGGCTAGCCCC

Desired output:期望的输出：
>chr1:117223140-117223856 >chr1:117223140-117223856
CGTGG CGTG
>chr1:117223140-117223856 >chr1:117223140-117223856
TGGGC TGGGC

Answer 1

BioPython is what you want to use: BioPython是你想要使用的：

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict

# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
    for line in f:
        name, start, stop = line.split()
        positions[name].append((int(start), int(stop)))

# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))

# search for short sequences
short_seq_records = []
for name in positions:
    for (start, stop) in positions[name]:
        long_seq_record = records[name]
        long_seq = long_seq_record.seq
        alphabet = long_seq.alphabet
        short_seq = str(long_seq)[start-1:stop]
        short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
        short_seq_records.append(short_seq_record)

# write to file
with open('output.fasta', 'w') as f:
    SeqIO.write(short_seq_records, f, 'fasta')

Answer 2

try, with:尝试，与：

from Bio import SeqIO

#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])

output = open("output.fasta", "w")
for line in open("input.bed"):
  id, begin, end = line.split()
  if id in dict_fasta:
    #[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
    #[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
    output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
  else:
    print id + " don't found"

output.close()

you get, first base in a chromosome is numbered 1:你知道，染色体中的第一个碱基编号为 1：

>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC

you get, first base in a chromosome is numbered 0:你知道，染色体中的第一个碱基编号为 0：

>chr1:117223140-117223856
GTGGG
>chr1:117223140-117223856
GGGCT

Answer 3

Your bed file needs to be tab-delimited for bedtools to use it.您的床铺文件需要以制表符分隔，床铺工具才能使用它。 Replace your colons, dashes, and spaces with a tab.用制表符替换冒号、破折号和空格。

The BedTools doc page says "bedtools requires that all BED input files (and input received from stdin) are tab-delimited." BedTools 文档页面说“bedtools 要求所有 BED 输入文件（以及从标准输入接收的输入）都以制表符分隔。” BedTools . 床具。

Python：使用 Bed 文件从 FASTA 文件中提取 DNA 序列

问题描述

3 个解决方案

解决方案1
3 已采纳 2015-05-28 10:56:49

解决方案2
1 2015-05-28 10:58:32

解决方案3
1 2016-02-05 01:20:21

Python：使用 Bed 文件从 FASTA 文件中提取 DNA 序列

问题描述

3 个解决方案

解决方案1 3 已采纳 2015-05-28 10:56:49

解决方案2 1 2015-05-28 10:58:32

解决方案3 1 2016-02-05 01:20:21

解决方案1
3 已采纳 2015-05-28 10:56:49

解决方案2
1 2015-05-28 10:58:32

解决方案3
1 2016-02-05 01:20:21