简体   繁体   English

分割一个fasta文件并在第一行的基础上重命名

[英]split a fasta file and rename on the basis of first line

I have a huge file with following content: 我有一个包含以下内容的大文件:

filename: input.txt 文件名:input.txt

>chr1
jdlfnhl
dh,ndh
dnh.

dhjl

>chr2
dhfl
dhl
dh;l

>chr3

shgl
sgl

>chr2_random
dgld

I need to split this file in such a way that I get four separate file as below: 我需要以以下方式拆分此文件,以便获得四个单独的文件,如下所示:

file 1: chr1.fa 文件1:chr1.fa

>chr1
jdlfnhl
dh,ndh
dnh.

dhjl

file 2: chr2.fa 文件2:chr2.fa

>chr2
dhfl
dhl
dh;l

file 3: chr3.fa 文件3:chr3.fa

>chr3

shgl
sgl

file 4: chr2_random.fa 档案4:chr2_random.fa

>chr2_random
dgld

I tried csplit in linux, but could not rename them by the text immediately after ">". 我在Linux中尝试了csplit,但无法在“>”之后立即用文本重命名它们。

csplit -z input.txt '/>/' '{*}'

Since you indicate you're on a Linux box 'awk' seems to be the right tool for the job. 由于您表示自己在Linux机器上,因此“ awk”似乎是完成此任务的正确工具。

USAGE: 用法:
./foo.awk your_input_file

foo.awk: foo.awk:

#!/usr/bin/awk -f

/^>chr/ {
    OUT=substr($0,2) ".fa"
}

OUT {
    print >OUT
}

You can do that also in one line: 您也可以在一行中执行此操作:

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input

If you find yourself wanting to do anything more complicated with FASTA/FASTQ files, you should consider Biopython. 如果您发现自己想对FASTA / FASTQ文件做任何更复杂的事情,则应考虑使用Biopython。

Here's a post about modifying and re-writing FASTQ files: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ 这是有关修改和重写FASTQ文件的帖子: http : //news.open-bio.org/news/2009/09/biopython-fast-fastq/

And another about splitting up FASTA files: http://lists.open-bio.org/pipermail/biopython/2012-July/008102.html 关于拆分FASTA文件的另一种方法: http : //lists.open-bio.org/pipermail/biopython/2012-July/008102.html

Slightly messy script, but should work on a large file as it only reads one line at a time 脚本有点混乱,但是应该可以处理大型文件,因为它一次只能读取一行

To run, you do python thescript.py input.txt (or it'll read from stdin, like cat input.txt | python thescript.py ) 要运行,您需要执行python thescript.py input.txt (否则它将从stdin中读取,例如cat input.txt | python thescript.py

import sys
import fileinput

in_file = False

for line in fileinput.input():
    if line.startswith(">"):
        # Close current file
        if in_file:
            f.close()

        # Make new filename
        fname = line.rstrip().partition(">")[2]
        fname = "%s.fa" % fname

        # Open new file
        f = open(fname, "w")
        in_file = True

        # Write current line
        f.write(line)

    elif in_file:
        # Write line to currently open file
        f.write(line)

    else:
        # Something went wrong, no ">chr1" found yet
        print >>sys.stderr, "Line %r encountered, but no preceeding > line found"

Your best bet would be to use the fastaexplode program from the exonerate suite : 最好的选择是使用exonerate 套件中的fastaexplode程序:

$ fastaexplode -h
fastaexplode from exonerate version 2.2.0
Using glib version 2.30.2
Built on Jan 12 2012
Branch: unnamed branch

fastaexplode: Split a fasta file up into individual sequences
Guy St.C. Slater. guy@ebi.ac.uk. 2000-2003.

Synopsis:
--------
fastaexplode <path>

General Options:
---------------
-h --shorthelp [FALSE] <TRUE>
   --help [FALSE] 
-v --version [FALSE] 

Sequence Input Options:
----------------------
-f --fasta [mandatory]  <*** not set ***>
-d --directory [.] 

--
with open('data.txt') as f:
    lines=f.read()
    lines=lines.split('>')
    lines=['>'+x for x in lines[1:]]
    for x in lines:
        file_name=x.split('\n')[0][1:]  #use this variable to create the new file
        fil=open(file_name+'.fa','w')
        fil.write(x)
        fil.close()

If you specifically want to try this with python ,You can use this code 如果您特别想使用python尝试此操作,则可以使用此代码

f2 = open("/dev/null", "r")
f = open("input.txt", "r")
for line in f:
    if ">" in line:
        f2.close()
        f2 = open(line.split(">")[1]),"w")
    else:
        f2.write(line)

f.close()

Alternatively, BioPython could have been used. 或者,可以使用BioPython。 Installing it in a virtualenv is easy: 在virtualenv中安装它很容易:

virtualenv biopython_env
source biopython_env/bin/activate
pip install numpy
pip install biopython

And once this is done, splitting the fasta file is easy. 一旦完成,分割fasta文件就很容易了。 Let's assume you have the path to the fasta file in the fasta_file variable: 假设您在fasta_file变量中具有fasta文件的路径:

from Bio import SeqIO

parser = SeqIO.parse(fasta_file, "fasta")
for entry in parser:
   SeqIO.write(entry, "chr{}.fa".format(entry.id), "fasta")

Note that this version of format works in Python2.7, but it might not work in older versions. 请注意,这种格式的版本适用于Python2.7,但可能不适用于旧版本。

As for performance, I used this to split the human genome reference from the 1000 Genomes project in negligible time, but I don't know how it would work for larger files. 至于性能,我用它在很短的时间内将人类基因组参考值从1000个基因组项目中分离出来,但是我不知道它将如何用于更大的文件。

#!/usr/bin/perl-w
use strict;
use warnings;


my %hash =();
my $key = '';
open F, "input.txt", or die $!;
while(<F>){
    chomp;
    if($_ =~ /^(>.+)/){
        $key = $1;
    }else{
       push @{$hash{$key}}, $_ ;
    }   
}

foreach(keys %hash){
    my $key1 = $_;
    my $key2 ='';
    if($key1 =~ /^>(.+)/){
        $key2 = $1;
    }
    open MYOUTPUT, ">","$key2.fa", or die $!;
    print MYOUTPUT join("\n",$_,@{$hash{$_}}),"\n";
    close MYOUTPUT;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM