Use Perl to count occurrences of all words in a file or in all files in a directory

Question

So I am trying to write a Perl script which will take in 3 arguments.

First argument is the input file or directory.
- If it is a file, it will count number of occurrences of all words
- If it is a directory, it will recursively go through each directory and get all the number of occurrences for all words in the files within those directories
Second argument is a number that will be how many of the words to display with the highest number of occurrences.
- This will print to the console only the number for each word
Print them to an output file which is the third argument in the command line.

It seems to be working as far as recursively searching through directories and finding all occurrences of the words in a file and prints them to the console.

How can I print these to an output file and also, how would I take the second argument, which is the number, say 5, and have it print to the console the number of words with the most occurrences while printing the words to the output file?

The following is what I have so far:

#!/usr/bin/perl -w

use strict;

search(shift);

my $input  = $ARGV[0];
my $output = $ARGV[1];
my %count;

my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
    print("This is a file!\n");
    while ( my $line = <$filename> ) {
        chomp $line;
        foreach my $str ( $line =~ /\w+/g ) {
            $count{$str}++;
        }
    }
    foreach my $str ( sort keys %count ) {
        printf "%-20s %s\n", $str, $count{$str};
    }
}
close($filename);
if ( -d $input ) {

    sub search {
        my $path = shift;
        my @dirs = glob("$path/*");
        foreach my $filename (@dirs) {
            if ( -f $filename ) {
                open( FILE, $filename ) or die "ERROR: Can't open file";
                while ( my $line = <FILE> ) {
                    chomp $line;
                    foreach my $str ( $line =~ /\w+/g ) {
                        $count{$str}++;
                    }
                }
                foreach my $str ( sort keys %count ) {
                    printf "%-20s %s\n", $str, $count{$str};
                }
            }
            # Recursive search
            elsif ( -d $filename ) {
                search($filename);
            }
        }
    }
}

Answer 1

This will total up the occurrences of words in a directory or file given on the command line:

#!/usr/bin/env perl
# wordcounter.pl
use strict;
use warnings;
use IO::All -utf8; 
binmode STDOUT, 'encoding(utf8)'; # you may not need this

my @allwords;
my %count;  
die "Usage: wordcounter.pl <directory|filename> number  \n" unless ~~@ARGV == 2 ;

if (-d $ARGV[0] ) {
  push @allwords, $_->slurp for io($ARGV[0])->all_files; 
}
elsif (-f $ARGV[0]) {
  @allwords = io($ARGV[0])->slurp ;
}

while (my $line = shift @allwords) { 
    foreach ( split /\s+/, $line) {
        $count{$_}++
    }
}

my $count_to_show;

for my $word (sort { $count{$b} <=> $count{$a} } keys %count) { 
 printf "%-30s %s\n", $word, $count{$word};
 last if ++$count_to_show == $ARGV[1];  
}

By modifying the sort and/or io calls you can sort { } by number of occurrences, alphabetically by word, either for a file or for all files in a directory. Those options would be fairly easy to add as parameters. You can also filter or change how words are defined for inclusion in the %count hash by changing foreach ( split /\\s+/, $line) to say, include a match/filter such as foreach ( grep { length le 5 } split /\\s+/, $line) in order to only count words of five or fewer letters.

Sample run in current directory:

   ./wordcounter ./ 10    
    the                            116
    SV                             87
    i                              66
    my_perl                        58
    of                             54
    use                            54
    int                            49
    PerlInterpreter                47
    sv                             47
    Inline                         47
    return                         46

Caveats

you should probably add a test for file mimetypes, readability, etc.
pay attention to unicode
to write to a file just add > filename.txt to the end of your commandline ;-)
IO::All is not the standard CORE IO package I am only advertising and promoting it here ;-) (you could swap that bit out)
If you wanted to added a sort_by option ( -n --numeric , -a --alphabetic etc. ) Sort::Maker might be one way to make that manageable.

EDIT had neglected to add options as OP requested.

Answer 2

I would suggest restructuring your program/script. What you have posted is a difficult to follow. A few comments might be helpful to follow what is happening. I'll try to go through how I would arrange things with some code snippets to hopefully help to explain items. I'll go through the three items you outlined in your question.

Since the first argument can be a file or directory, I would use -f and -d to check to determine what is the input. I would use an list/array to contain a list of file to be processed. IF it was only a file, I would just push it onto to the processing list. Otherwise, I would call a routine to return a list of files to be processed (similar to your search subroutine). Something like:

# List file files to process
my @fileList = ();
# if input is only a file
if ( -f $ARGV[0] )
{
  push @fileList,$ARGV[0];
}
# If it is a directory
elsif ( -d $ARGV[0] ) 
{
   @fileList = search($ARGV[0]);
}

So in your search subroutine, you need a list/array onto which to push items which are files and then return the array from the subroutine (after you have processed the list of files from the glob call). When you have a directory, you call search with the path (just as you are currently doing), pushing the elements on your current array, such as

# If it is a file, save it to the list to be returned
if ( -f $filename ) 
{
  push @returnValue,$filename;
}
# else if a directory, get the files from the directory and 
# add them to the list to be returned
elsif ( -d $filename )
{
  push @returnValue, search($filename);
}

After you have the file list, loop through it processing each file (opening, reading lines in closing, processing the lines for the words). The foreach loop you have for processing each line works correctly. However, if your words have periods, commas or other punctuation, you may want to remove those items before counting the word in a hash.

For the next part, you asked about determining the words with the highest counts. In that case, you want make another hash which has a key of counts (for each word), and the value of that hash is a list/array of words associated with that number of counts. Something like:

# Hash with key being a number and value a list of words for that number
my %totals= ();
# Temporary variable to store occurrences (counts) of the word
my $wordTotal;
# $w is the words in the counts hash
foreach my $w ( keys %counts ) 
{
  # Get the counts for the word
  $wordTotal = $counts{$w};
  # value of the hash is an array, so de-reference the array ( the @{ }, 
  # and push the value of the counts array onto the array
  push @{ $totals{$wordTotal} },$w;  # the key to total is the value of the count hash
                                     # for which the words ($w) are the keys
}

To get the words with the highest counts you need to get the keys from the total and reverse a sorted list (numerically sorted) to get the N number of highest. Since we have an array of values, we will have to count each output to get the N number of highest counts.

# Number of items outputted
my $current = 0;
# sort the total (keys) and reverse the list so the highest values are first
# and go through the list
foreach my $t ( reverse sort { $a <=> $b} keys %totals) # Use the numeric 
                                                        # comparison in 
                                                        # the sort 
{
   # Since each value of total hash is an array of words,
   # loop through that array for the values and print out the number 
   foreach my $w ( sort @{$total{$t}}
   {
     # Print the number for the count of words
     print "$t\n";
     # Increment the number output
     $current++;
     # if this is the number to be printed, we are done 
     last if ( $current == $ARGV[1] );
   }
   # if this is the number to be printed, we are done 
   last if ( $current == $ARGV[1] );
 }

The third part of printing to a file, it is unclear what "them" is (words, counts or both; limited to top ones or all of the words) from your question. I will leave that effort for you to open a file, print out the information to the file and close the file.

Answer 3

I have figured it out. The following is my solution. I'm not sure if it's the best way to do it, but it works.

    # Check if there are three arguments in the commandline
    if (@ARGV < 3) {
       die "ERROR: There must be three arguments!\n";
       exit;
    }
    # Open the file
    my $file = shift or die "ERROR: $0 FILE\n";
    open my $fh,'<', $file or die "ERROR: Could not open file!";
    # Check if it is a file
    if (-f $fh) {
       print("This is a file!\n");
       # Go through each line
       while (my $line = <$fh>) {
          chomp $line;
          # Count the occurrences of each word
          foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
             $count{$str}++;
          }
       }
    }

    # Check if the INPUT is a directory
    if (-d $input) {
       # Call subroutine to search directory recursively
       search_dir($input);
    }
    # Close the file
    close($fh);
    $high_count = 0;
    # Open the file
    open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
    # Sort the most occurring words in the file and print them
    foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
       $high_count++;
       if ($high_count <= $num) {
          printf "%-31s %s\n", $str, $count{$str};
       }
       printf $fileh "%-31s %s\n", $str, $count{$str};
    }
    exit;

    # Subroutine to search through each directory recursively
    sub search_dir {
       my $path = shift;
       my @dirs = glob("$path/*");
       # Loop through filenames
       foreach my $filename (@dirs) {
          # Check if it is a file
          if (-f $filename) {
             # Open the file
             open(FILE, $filename) or die "ERROR: Can't open file";
             # Go through each line
             while (my $line = <FILE>) {
                chomp $line;
                # Count the occurrences of each word
                foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
                   $count{$str}++;
                }
             }
             # Close the file
             close(FILE);
          }
          elsif (-d $filename) {
             search_dir($filename);
          }
       }
    }

Use Perl to count occurrences of all words in a file or in all files in a directory

Question

3 answers

solution1
0 2014-09-26 02:07:59

solution2
0 2014-09-26 02:49:48

solution3
0 ACCPTED 2014-09-29 01:09:08

Use Perl to count occurrences of all words in a file or in all files in a directory

Question

3 answers

solution1 0 2014-09-26 02:07:59

solution2 0 2014-09-26 02:49:48

solution3 0 ACCPTED 2014-09-29 01:09:08

solution1
0 2014-09-26 02:07:59

solution2
0 2014-09-26 02:49:48

solution3
0 ACCPTED 2014-09-29 01:09:08