简体   繁体   中英

Perl - Count occurrence of specific words for each line of file

Did a lot of searching, nothing quite what I wanted. Perl noob here.

I have a text file already neatly organised into lines of data. Say the two strings I'm interested in are "hello" and "goodbye". I want to write a quick Perl script that will look at the first line and count how many times "hello" and "goodbye" occur. Then it will go to the next line and do the counts, adding to the earlier counts. So by the end of the script I can print the total number of counts for each string in the file. The reason the line-by-line approach is important is because I want to use several counts so I can print the number of times both words are in the same line, the number of times a line contains just one of the words and not the other, the number of times a line contains "hello" once but "goodbye" multiple times etc. Really it's about the number of times each condition is found on a line, rather than how many times the words appear in the whole document.

So far I'm thinking:

#!/usr/bin/perl
use strict; use warnings;

die etc (saving time by not including it here)

my $word_a = "hello";
my $word_b = "goodbye";
my $single_both = 0; # Number of lines where both words appear only once.
my $unique_hello = 0; # Number of lines where only hello appears, goodbye doesn't.
my $unique_goodbye = 0; # Number of lines where goodbye appears, hello doesn't.
my $one_hello_multiple_goodbye = 0; # Number of lines where hello appears once and goodbye appears multiple times.
my $one_goodbye_multiple_hello = 0; # Number of lines where goodbye appears once and hello appears multiple times.
my $multiple_both = 0; = # Number of lines where goodbye and hello appear multiple times.

while (my $line = <>) {

Magic happens here

};

# then the results for each of those variables can be printed at the end.

As I said, I'm a noob. I'm confused about how to even count the occurrences in each line. Even if I knew that I'm sure I could then figure out all the different conditions I've listed above. Should I be using arrays? Hashes? Or have I approached this in entirely the wrong direction considering what I want. I need to count the number of lines that have the different conditions I've listed as comments after those variables. Any help at all is greatly appreciated!

You can count occurrence of some word by regex, eg $hello = () = $line =~ /hello/g; counts hello occurrence in $line How it works?

perl -n -E '$hello = () = /hello/g; $goodbye = () = /goodbye/g; say "line $.: hello - $hello, goodbye - $goodbye"; $hello_total += $hello; $goodbye_total += $goodbye;}{say "total: hello - $hello_total, goodbye - $goodbye_total";' input.txt

output for some file:

line 1: hello - 0, goodbye - 0
line 2: hello - 1, goodbye - 0
line 3: hello - 1, goodbye - 1
line 4: hello - 3, goodbye - 0
line 5: hello - 0, goodbye - 0
line 6: hello - 1, goodbye - 1
line 7: hello - 0, goodbye - 0
total: hello - 6, goodbye - 2

Perl has a binding operator =~ that tests if a string matches a pattern. You can use this in combination with two if statements to pull out the counts from all of your lines:

# only gathers counts
while (my $line = <STDIN>) {
   $hello_cnt++  if $line =~ /hello/;
   $goobye_cnt++ if $line =~ /goodbye/;
}

but it seems like you want to reason about your input line by line, and you could maintain all of those variables: $unique_hello , $unique_goodbye , etc... but that seems like a lot of extra work to me, what you can do is hash to total the counts:

my %seen;
while (my $line = <STDIN>) {
   chomp $line;                   # remove trailing \n

   map {
      $seen{lc $_}++;
   } split /\s+/, $line;          # split on whitespace
}

Now you have a hash of this structure:

{ 
  word1 => cnt1,
  word2 => cnt2,
  etc ...
}

Now you can just print the totals:

print "Hello seen " . $seen{hello} . " times";
# etc ...

I left off the line by line analysis for you do, hopefully this is a good starting point.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM