简体   繁体   中英

Perl - Reading two files to compare contents

I am working with a text file that contains data in a format like so:

To Kill A Mocking Bird|Harper Lee|S1|4A
Life of Pi|Yann Martel|S3|5B
Hunger Games|Suzzanne Collins|S2|2C

The actual data file has many more entries, and there are more than 3 instances of S1 .

I am writing a program in Perl to compare the data in this file with another file, mainly the filing information like S1 , 4A .

I approached this by first storing the data from the file into a string. I then split the string by using pipe | as a delimiter and stored it into an array. I then used a foreach loop to iterate through the array to find matching information.

Note that all files are in the same directory.

#!/usr/bin/perl

open(INFO, "psychnet3.data");
my $dbinfo = <INFO>;
close(INFO);

@dbarray = split("|", $dbinfo);
$index_counter = 0;

foreach $element (@dbarray) {

  if ($element =~ "S1") {
    open(INFO, ">>logfile.txt");
    print INFO "found a S1";
    close(INFO);

    if ($dbarray[$index_counter + 1] =~ "4A") {
      $counter++;
      open(INFO, ">>logfile.txt");
      print INFO "found S1 4A";
      close(INFO);
    }
  }
  $index_counter++;
}

In the output file, it does not find all instances of S1 .

I also tried using eq as a conditional instead of =~ and still no luck.

I am new to Perl, coming from C#, is there any syntax I'm making a mistake with, or is it a logic error?

There are quite a few ways to do this, some of which include regular expressions and some other don't. If the fields you seek are the only the 3rd and 4th of the file and your files have a standard structure, then it can be done like this

EDIT:

The file is not so consistent, so use a regex instead.

Also removed the @dbinfo array. It's not necessary and memory is not free :)

(remember to change the name of the filehandle, to avoid conflict with inner loop filehandles with same name)

open(MINFO, "psychnet3.data");
while (my $line = <MINFO>) {
    if ( $line =~ m/\|S1/i ) {
        open(INFO, ">>logfile.txt");
        print INFO "found a S1";
        close(INFO);

        $line =~ m/\|4A/i
          $counter++;
          open(INFO, ">>logfile.txt");
          print INFO "found S1 4A";
          close(INFO);
        }
    }
}
close(<MINFO);

You don't mention how you compare this data. Is this done by the book title? Or is this done by author? That makes things a bit difficult to know exactly how this information needs to be stored.

Your data is a bit more complex than storing individual pieces of data. This means that the default Perl data structures, the scalar ( $foo ), the array ( @foo ), and the hash ( %foo ) simply won't cut it. It's time to learn about references .

Technically, a reference is the location in memory where some other item is stored. You create a reference by putting a backslash in front of the name:

$ref_to_foo_array = \@foo;

The $ref_to_foo_array is the memory location of where my @foo array is stored. The big advantage is instead of referring to a whole array of values, I am now referring to a single value: The location in memory where @foo is stored. That mean I can put that information into an array or hash:

$bar[0] = $ref_to_foo_array;
$bar[1] = $ref_to_some_other_array;

Now, @bar isn't merely storing two values. Instead, @bar is storing the information in two arrays! I have an Array of Arrays! .

To get back my original array, I simply dereference it by putting the correct sigil in front of my reference:

@foo = @{ $bar[0] };

To make things easier, I can use the -> as a means of dereferencing a single value:

$array_reference = $bar[0];
$array_reference->[0];   # First item in the array being referenced
$array_reference->[1];   # Second item

Of course, I could do this too:

$bar[0]->[0] # First item in the array being referenced

So what does all this do? Watch:

use strict;
use warnings;
use autodie;
use feature qw(say);

use constant {
    BOOK_FILE  => 'psychnet3.data',
};

open my $book_fh, "<", BOOK_FILE;

my %book_hash;
for my $book ( <$book_fh> ) {
    chomp $book;
    my ( $title, $author, $section, $shelf ) = split /\s*\|\s*/, $book;

    my $temp_book_hash;
    $temp_book_hash{AUTHOR} = $author;
    $temp_book_hash{SECTION} = $section;
    $temp_book_hash{SHELF} = $shelf;

    $book_hash{$title} = \$temp_book_hash;
}

I have a %temp_book_hash which is keyed by the title of the book. However, this single hash stores the author, section, and self of where that book is stored. Each book has three different bits of information associated with it, but I am able to store all of that information in a single data structure. No need to keep parallel arrays or hashes.

How do I get this information? Simple:

my $title = "To Kill a Mockingbird";
my %temp_book_hash = %{ $book_hash{$title} };
say "The book $title was written by $temp_book_hash{AUTHOR}";

By dereferencing the hash I had stored in $book_hash{$title} , I can pull out the author's name, and filing information.

The syntax is a bit clunky. I am constantly making temporary variables to pass the information back and forth. Fortunately, Perl allows me to skip that step. Here's the same loop as before:

for my $book ( <$book_fh> ) {
    chomp $book;
    my ( $title, $author, $section, $shelf ) = split /\s*\|\s*/, $book;

    $book_hash{$title} = {};   # Line not necessary

    $book_hash{$title}->{AUTHOR}  = $author;
    $book_hash{$title}->{SHELF}   = $shelf;
    $book_hash{$title}->{SECTION} = $section;
}

Instead of having that temporary hash, I can store the date directly into my outer most hash. This syntax is a lot shorter and cleaner. And, it's easier to understand.

The line $book_hash{$title} = {}; is declaring that $book_hash{$title} will be storing a hash reference and not some standard string or number. This line isn't necessary at all. Perl will figure out you're storing a hash reference with $book_hash{$title}->{AUTHOR} = $author; . However, I like to _declare my intension that I am storing a reference in that variable. That way, if further down in my program I have $book_hash{$title} = $author; , another developer will recognize I've made a mistake.

I can use that same -> notation to pull out the information from my book without having to create temporary variables too:

my $title = "To Kill a Mockingbird";
say "The book $title was written by " . $book_hash{$title}->{AUTHOR};

You mentioned that you're comparing two files. Imagine I store the first one in %book_hash and the second one in $book_hash2 . I can loop through my books and see which once are incorrectly shelved.

for my $title ( keys %book_hash ) {
    if ( $book_hash{$title}->{SHELF} ne $book_hash2{$title}->{SHELF} ) {
       say "The book $title is stored on two different shelves!"
    }
    else {
       say "The book $title is on the correct shelf";
    }
}

References are a bit hard to understand, but I hope you can see the power of being able to store all of your information about your book in a single data structure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM