简体   繁体   中英

How to remove the whitespaces in fasta file using perl?

My fasta file

>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS 

Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.

open(fas,'d:\a4.fas');
$s=<fas>;
@fasta = <fas>;
@r1 = grep{s/\s//g} @fasta; #It is not remove the white space
@r2 = grep{s/(\s)$//g} @fasta; #It is not working
@r3 = grep{s/.$//g} @fasta; #It is remove the last character, but not remove the    last space
print "@r1\n@r2\n@r3\n";

These codes are give the outputs is:

PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS

I expect Remove the whitespaces from line two and above the lines. How can i do it?

Using perl one liner,

perl -i -pe 's|[ \t]||g' a4.fas

removing all white spaces, including new lines,

perl -i -pe 's|\s||g' a4.fas
use strict;
use warnings;

while(my $line = <DATA>) {
    $line =~ s/\s+//g;
    print $line;
}


__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS 

grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true .

A substitution s/// is true unless it made no changes to the target string, so of your grep statements,

@r1 = grep { s/\s//g } @fasta

This removes all spaces, including newlines, from the strings in @fasta . It puts in @r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.

@r2 = grep { s/(\s)$//g } @fasta

Because of the anchor $ , this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in @r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.

@r3 = grep { s/.$//g } @fasta;

This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in @r3 only those elements that contain more than just a newline, which again is probably all of them.

I think you want to retain the newlines (which are normally considered as whitespace).

This example will read the whole file, apart from the header, into the variables $data , and then use tr/// to remove spaces and tabs.

use strict;
use warnings;
use 5.010;
use autodie;

my $data = do {
  open my $fas, '<', 'D:\a4.fas';
  <$fas>; # Drop the header
  local $/;
  <$fas>;
};

$data =~ tr/ \t//d;
print $data;

Per perlrecharclass :

\\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \\H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.

Therefore the following will display your file with horizontal spacing removed:

perl -pe "s|\h+||g" d:\a4.fas

If you don't want to display the header, just add a condition with $.

perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas

Note: I used double quotes in the above commands since your D:\\ volume implies you're likely on Windows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM