简体   繁体   中英

Change the identifier line name to random shortened name in fasta file

I have a fasta file with about 8,000 sequences in it. I need to change the identifier line name to a random unique shorten name (max length 10). The fasta file contains seqences like this.

>AX039539.1.1212 Bacteria;Chloroflexi;Dehalococcoidia;Dehalococcoidales;
GAUGAACGCUAGCGGCGUGCCUUAUGCAUGCAAGUCGAACGGUCUUAAGCAAUUAAGAUAGUGGCAAACGGGUGAGUAACGCGUAAGUAACCUACCUCUAAGUGGGGGAUAGCUUCGGGAAACUGAAGGUAAUACCGCAUGUGGUGGGCCGACAUAAGUUGGUUCACUAAAGCCGUAAGGUGCUUGGUGAGGGGCUUGCGUCCGAUUAGCUAGUUGGUGGGGUAACGGCCUACCAAGGCUUCGAUCGGUAGCUGGUCUGAGAGGAUGAUCAGCCACACUGGGACUGAGACACGGCCCAGACUCCUACGGGAG

Here is my script so far:

use strict; 
use warnings;

#change ID line name to random unique shorten (max 10 characters) string

open (my $fh,"$ARGV[0]") or die "Failed to open file: $!\n";
open (my $out_fh, ">$ARGV[0]_shorten_ID.fasta");

my $string;

while(<$fh>) {

  for (0..9) { $string .= chr( int(srand(rand(25) + 65) )); }

  if ($_ =~ s/^>*.+\n/>$string/){  # change header FASTA header    

    print $out_fh "$_";

  }
}

close $fh;
close $out_fh;

I have been trying this but it starts with 10 characters then adds 10 more on as goes down and I lose the sequence. I realize there are similar question already but it is slightly different, I need to randomly generate unique shortened names.

Your problem can simply be fixed by resetting $string to an empty string just inside the while loop. But this is needlessly complex (and also inefficient -- you generate and throw away random identifiers when you are not looking at a line starting with > ); I would go with just

perl -pe 'BEGIN { srand(time()); }
    s/>.*/ ">" . join ("", map { chr(rand(25)+65) } 0..9) /e' file.fasta

If you do not absolutely require properly pseudorandom identifiers, maybe go with just

perl -pe 'BEGIN { $id = "a" x 7 } s/>.*/">" . $id++/e' file.fasta

which produces identifiers like "aaaaaaa", "aaaaaab", etc. (I went for seven-character identifiers but four characters would be more than enough for 8,000 unique id:s; you'd end at "alvr".)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM