简体   繁体   中英

Read FASTA file into hash

I'm trying to read a FASTA file into a hash.

Problem

my $primer_seq_14bp = q(); // --> works
my $primer_seq_14bp; // --> error
Use of uninitialized value $primer_seq_14bp in hash element at ./script.pl line 24, <FWDPRIMER> line 1.

Why does this variable need to be declared as an empty string?

This is part of my first "on my own" script, so feedback on formatting, efficiency, etc. is welcome!

Code

#!/usr/bin/perl
use strict; use warnings;

my %primer_name_for;
my $primer_name;
my $primer_seq_14bp;

# Store fwd primer sequences in a sequence{name} hash.
open FWDPRIMER, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
while (<FWDPRIMER>) {
    chomp;
    if ( /^(>.+)/ ) {
        $primer_name = $1;
    }
    else {
        $primer_seq = $_;
        $primer_seq_14bp = substr $primer_seq, 0, 14;
    }
    $primer_name_for{$primer_seq_14bp} = $primer_name;
}
close FWDPRIMER;

Input

Fasta file, which is formatted:

>Sequence_name
DNA_sequence
>Sequence_name
DNA_sequence

etc.

The entire file is in this format, so I thought I could use $_ because if the line isn't a name it has to be a sequence.

Goal

A hash where DNA_sequence is the key for Sequence_name . The ultimate action is to search the hash with read sequences to get the biological name of the read (reads come with a default name).

Example

Input primer

>snp_fwd_primer
AAGCTCCTGCAGGTCATCTC

Input read

>read_name
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG

Desired output

Note that the first 20 bp of the read matches the primer sequence. The script cuts the read sequences to 14 bp because that is the length of my shortest primer sequence.

>snp_fwd_primer
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG

Update

I have a solution, however, the answers below are much better. I'm sharing it for other lost biologists (my solution is based on bioinformatics forums).

I changed the record delimiter $/ to read the input primer fasta one entry at a time (I always convert multi-line fasta to single-line; for multi-fasta see @ikegami's answer). Then used regrex to separate the header from the sequence. The major downside is that I have to manually remove the very first > from the input primer fasta (learned that you can't close and reopen the same file in a perl script).

This did not apply to the input read fasta because I wanted only the sequence line, so I skipped any line that began with > .

#!/usr/bin/perl
# Usage: thisScript.pl fwdPrimers.fasta reads.fasta > reads_with_primer_names.fasta
# Fasta files need to be in single-line format.
# Before running, remove the first instance of '>' from the fwdPrimerFastaFile!
# Purpose: for reads that have marker forward primer sequences, rename the read with the marker name.

use strict; use warnings;

open PRIMERFILE, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
$/ = '>';

# Store fwd primer sequences in a sequence{name} hash.
my %primer_name_for;

while (<PRIMERFILE>) {
    /(.+)\n(.+)\n/ and my $primer_name = $1 and my $primer_seq = $2;
    my $primer_seq_14bp = substr $primer_seq, 0, 14;
    $primer_name_for{$primer_seq_14bp} = $primer_name;
}
close PRIMERFILE;

# Store reads in a full_length_read{trimmed_read} hash. 
# Use trimmed read to search the fwd primer hash. 
# Combine primer name and full length read.
open READFILE, '<', $ARGV[1] or die "error reading $ARGV[1]\n";
$/ = "\n";

my %read_fullSeq_for;
my $read_seq;

while (<READFILE>) {
    chomp;
    if($_ =~ /^>/){ 
        next;
    }
    else {$read_seq = $_};
    my $read_seq_14bp = substr $read_seq, 0, 14;
    $read_fullSeq_for{$read_seq_14bp} = $read_seq;
    if (exists $primer_name_for{$read_seq_14bp} and $read_fullSeq_for{$read_seq_14bp}) {
        print ">$primer_name_for{$read_seq_14bp}\n$read_fullSeq_for{$read_seq_14bp}\n";
    }
}
close READFILE;

__END__

Say the first line matches /^(>.+)/ . Then you end up doing

$primer_name_for{undef} = $primer_name;

because you never assigned anything to

$primer_seq_14bp

You should only be adding to the hash when you have a sequence.

my %primer_name_for;
while (<>) {
    chomp;
    if (/^>(.*)/s) {
        $primer_name = $1;
    } else {
        my $primer_seq = $_;
        my $primer_seq_14bp = substr($primer_seq, 0, 14);
        $primer_name_for{$primer_seq_14bp} = $primer_name;
    }
}

One problem. The sequences can span multiple lines. To fix this, we'll just add to the hash when we encounter a heading or EOF.

my (%primer_name_for, $name, $seq);
while (<>) {
    chomp;
    if (/^>(.*)/s) {
        $primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
        $name = $_;
        $seq = '';
    } else {
        $seq .= $_;
    }
}

$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);

Please investigate following sample code for compliance with your problem.

My understanding that OP tries to build reverse lookup table for sequence name as hash value, based on partial DNA sequence as a hash key.

NOTE: it would be helpful for clarification of the problem to have some sample input data and desired output

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $length = 14;
my %sequence = do { local $/; <DATA> =~ />(.*?)\n([^>]*)/gs};

$sequence{$_} =~ s/\n//g for keys %sequence;

my %lookup_table = map { substr($sequence{$_},0,$length) => $_ } keys %sequence;

say Dumper(\%sequence);
say Dumper(\%lookup_table);

exit 0;

__DATA__
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Output

$VAR1 = {
          'SEQUENCE_2' => 'SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH',
          'SEQUENCE_1' => 'MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL'
        };

$VAR1 = {
          'MTEITAAMVKELRE' => 'SEQUENCE_1',
          'SATVSEINSETDFV' => 'SEQUENCE_2'
        };

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM