I have a file in rows as below and would like to convert into two column format.
>00000_x1688514
TGCTTGGACTACATATGGTTGAGGGTTGTA
>00001_x238968
TGCTTGGACTACATATTGTTGAGGGTTGTA
...
Desired output is
>00000_x1688514 TGCTTGGACTACATATGGTTGAGGGTTGTA
>00001_x238968 TGCTTGGACTACATATTGTTGAGGGTTGTA
...
I would appreciate any help. Thanks.
I don't know if you are aware of the BioPerl modules for reading/writing and other genetic functions. Your problem can be written like this.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $file = 'o33.txt';
my $in = Bio::SeqIO->new( -file => $file,
-format => 'fasta');
while ( my $seq = $in->next_seq() ) {
print $seq->id, "\t", $seq->seq, "\n";
}
__END__
00000_x1688514 TGCTTGGACTACATATGGTTGAGGGTTGTA
00001_x238968 TGCTTGGACTACATATTGTTGAGGGTTGTA
In python:
fd = open('filepath')
cols = izip(fd, fd)
with open('output_filepath') as outfile:
for col in cols:
outfile.write('\t'.join(col).replace('\n', '') +'\n')
The desired output should be in output_filepath
Another Perl option is to set the record delimiter to '>', to read in two lines at a time, then substitute the newline for a tab:
use Modern::Perl;
local $/ = '>';
do { s/\n/\t/; print }
for <DATA>;
__DATA__
>00000_x1688514
TGCTTGGACTACATATGGTTGAGGGTTGTA
>00001_x238968
TGCTTGGACTACATATTGTTGAGGGTTGTA
Output:
>00000_x1688514 TGCTTGGACTACATATGGTTGAGGGTTGTA
>00001_x238968 TGCTTGGACTACATATTGTTGAGGGTTGTA
For a file:
use Modern::Perl;
use autodie;
open my $inFile, '<', 'inFile.txt';
open my $outFile, '>', 'outFile.txt';
local $/ = '>';
do { s/\n/\t/; print $outFile $_ }
for <$inFile>;
close $inFile;
close $outFile;
Hope this helps!
One approach:
perl -i -pe 's/\n/ / unless m/^[ACGT]+$/' FILENAME
This will in-place edit the file FILENAME
, replacing a newline with a space in every line that isn't a string of A's, C's, G's, and T's.
Using awk
:
awk '{ printf "%s", $0 (substr( $0, 1, 1 ) == ">" ? " " : ORS) }' infile
Output:
>00000_x1688514 TGCTTGGACTACATATGGTTGAGGGTTGTA
>00001_x238968 TGCTTGGACTACATATTGTTGAGGGTTGTA
In Ruby I'd use something like:
File.readlines('test.txt').map(&:strip).each_slice(2) do |row|
puts row.join(' ')
end
Which outputs:
>00000_x1688514 TGCTTGGACTACATATGGTTGAGGGTTGTA
>00001_x238968 TGCTTGGACTACATATTGTTGAGGGTTGTA
A tidier Python solution:
from itertools import izip
with open('test.txt') as inf, open('newtest.txt', 'w') as outf:
for head,body in izip(inf, inf):
outf.write(head.rstrip() + ' ' + body)
Assuming the input is in true FASTA
format, you can use awk
and the getline
function:
awk '/^>/ { printf "%s ", $0; getline; print }' file.txt
Output:
>00000_x1688514 TGCTTGGACTACATATGGTTGAGGGTTGTA
>00001_x238968 TGCTTGGACTACATATTGTTGAGGGTTGTA
HTH
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.