简体   繁体   English

使用Perl统计文件中或目录中所有文件中所有单词的出现次数

[英]Use Perl to count occurrences of all words in a file or in all files in a directory

So I am trying to write a Perl script which will take in 3 arguments. 因此,我试图编写一个Perl脚本,该脚本将包含3个参数。

  1. First argument is the input file or directory. 第一个参数是输入文件或目录。
    • If it is a file, it will count number of occurrences of all words 如果是文件,它将计算所有单词出现的次数
    • If it is a directory, it will recursively go through each directory and get all the number of occurrences for all words in the files within those directories 如果是目录,它将以递归方式遍历每个目录,并获取这些目录中文件中所有单词的所有出现次数
  2. Second argument is a number that will be how many of the words to display with the highest number of occurrences. 第二个参数是一个数字,该数字将显示出现的次数最多的单词数。
    • This will print to the console only the number for each word 这只会将每个单词的数字打印到控制台
  3. Print them to an output file which is the third argument in the command line. 将它们打印到输出文件,该文件是命令行中的第三个参数。

It seems to be working as far as recursively searching through directories and finding all occurrences of the words in a file and prints them to the console. 它似乎在递归地搜索目录并查找文件中所有单词的出现并将其打印到控制台中。

How can I print these to an output file and also, how would I take the second argument, which is the number, say 5, and have it print to the console the number of words with the most occurrences while printing the words to the output file? 如何将它们打印到输出文件中,如何将第二个参数(即数字,即5)打印到控制台,同时将出现次数最多的单词数打印到控制台?文件?

The following is what I have so far: 以下是我到目前为止的内容:

#!/usr/bin/perl -w

use strict;

search(shift);

my $input  = $ARGV[0];
my $output = $ARGV[1];
my %count;

my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
    print("This is a file!\n");
    while ( my $line = <$filename> ) {
        chomp $line;
        foreach my $str ( $line =~ /\w+/g ) {
            $count{$str}++;
        }
    }
    foreach my $str ( sort keys %count ) {
        printf "%-20s %s\n", $str, $count{$str};
    }
}
close($filename);
if ( -d $input ) {

    sub search {
        my $path = shift;
        my @dirs = glob("$path/*");
        foreach my $filename (@dirs) {
            if ( -f $filename ) {
                open( FILE, $filename ) or die "ERROR: Can't open file";
                while ( my $line = <FILE> ) {
                    chomp $line;
                    foreach my $str ( $line =~ /\w+/g ) {
                        $count{$str}++;
                    }
                }
                foreach my $str ( sort keys %count ) {
                    printf "%-20s %s\n", $str, $count{$str};
                }
            }
            # Recursive search
            elsif ( -d $filename ) {
                search($filename);
            }
        }
    }
}

This will total up the occurrences of words in a directory or file given on the command line: 这将总计出现在命令行上给出的目录或文件中的单词:

#!/usr/bin/env perl
# wordcounter.pl
use strict;
use warnings;
use IO::All -utf8; 
binmode STDOUT, 'encoding(utf8)'; # you may not need this

my @allwords;
my %count;  
die "Usage: wordcounter.pl <directory|filename> number  \n" unless ~~@ARGV == 2 ;

if (-d $ARGV[0] ) {
  push @allwords, $_->slurp for io($ARGV[0])->all_files; 
}
elsif (-f $ARGV[0]) {
  @allwords = io($ARGV[0])->slurp ;
}

while (my $line = shift @allwords) { 
    foreach ( split /\s+/, $line) {
        $count{$_}++
    }
}

my $count_to_show;

for my $word (sort { $count{$b} <=> $count{$a} } keys %count) { 
 printf "%-30s %s\n", $word, $count{$word};
 last if ++$count_to_show == $ARGV[1];  
}

By modifying the sort and/or io calls you can sort { } by number of occurrences, alphabetically by word, either for a file or for all files in a directory. 通过修改sort和/或io调用,您可以针对文件或目录中的所有文件,按出现次数sort { } (按字母顺序)进行排序。 Those options would be fairly easy to add as parameters. 这些选项很容易添加为参数。 You can also filter or change how words are defined for inclusion in the %count hash by changing foreach ( split /\\s+/, $line) to say, include a match/filter such as foreach ( grep { length le 5 } split /\\s+/, $line) in order to only count words of five or fewer letters. 还可以过滤或改变如何字用于包括在被限定%count ,通过改变散列foreach ( split /\\s+/, $line)的说法,包括一个匹配/过滤器如foreach ( grep { length le 5 } split /\\s+/, $line) ,以便仅计算五个或更少字母的单词。

Sample run in current directory: 示例在当前目录中运行:

   ./wordcounter ./ 10    
    the                            116
    SV                             87
    i                              66
    my_perl                        58
    of                             54
    use                            54
    int                            49
    PerlInterpreter                47
    sv                             47
    Inline                         47
    return                         46

Caveats 注意事项

  • you should probably add a test for file mimetypes, readability, etc. 您可能应该添加文件模仿类型,可读性测试
  • pay attention to unicode 注意unicode
  • to write to a file just add > filename.txt to the end of your commandline ;-) 要写入文件,只需在命令行末尾添加> filename.txt即可;-)
  • IO::All is not the standard CORE IO package I am only advertising and promoting it here ;-) (you could swap that bit out) IO::All不是标准的CORE IO包,我只是在这里做广告和促销;-)(您可以将其交换掉)
  • If you wanted to added a sort_by option ( -n --numeric , -a --alphabetic etc. ) Sort::Maker might be one way to make that manageable. 如果要添加sort_by选项( -n --numeric -a --alphabetic-a --alphabetic ),则Sort::Maker可能是使该选项易于管理的一种方法。

EDIT had neglected to add options as OP requested. EDIT忽略了按OP请求添加选项。

I would suggest restructuring your program/script. 我建议重组您的程序/脚本。 What you have posted is a difficult to follow. 您发布的内容很难遵循。 A few comments might be helpful to follow what is happening. 一些评论可能有助于了解正在发生的事情。 I'll try to go through how I would arrange things with some code snippets to hopefully help to explain items. 我将尝试通过一些代码片段来安排事情,以希望有助于解释项目。 I'll go through the three items you outlined in your question. 我将介绍您在问题中概述的三个项目。

Since the first argument can be a file or directory, I would use -f and -d to check to determine what is the input. 由于第一个参数可以是文件或目录,因此我将使用-f和-d来检查以确定什么是输入。 I would use an list/array to contain a list of file to be processed. 我将使用列表/数组包含要处理的文件列表。 IF it was only a file, I would just push it onto to the processing list. 如果只是一个文件,我将其推送到处理列表中。 Otherwise, I would call a routine to return a list of files to be processed (similar to your search subroutine). 否则,我将调用一个例程以返回要处理的文件列表(类似于您的搜索子例程)。 Something like: 就像是:

# List file files to process
my @fileList = ();
# if input is only a file
if ( -f $ARGV[0] )
{
  push @fileList,$ARGV[0];
}
# If it is a directory
elsif ( -d $ARGV[0] ) 
{
   @fileList = search($ARGV[0]);
}

So in your search subroutine, you need a list/array onto which to push items which are files and then return the array from the subroutine (after you have processed the list of files from the glob call). 因此,在搜索子例程中,您需要一个列表/数组,将要放入文件的项目压入该列表/数组,然后从子例程返回该数组(在处理了来自glob调用的文件列表之后)。 When you have a directory, you call search with the path (just as you are currently doing), pushing the elements on your current array, such as 当您拥有目录时,可以使用路径(就像您当前正在执行的那样)调用搜索,将路径推入当前数组中,例如

# If it is a file, save it to the list to be returned
if ( -f $filename ) 
{
  push @returnValue,$filename;
}
# else if a directory, get the files from the directory and 
# add them to the list to be returned
elsif ( -d $filename )
{
  push @returnValue, search($filename);
}

After you have the file list, loop through it processing each file (opening, reading lines in closing, processing the lines for the words). 获得文件列表后,循环遍历每个文件(打开,阅读结尾的行,处理单词的行)。 The foreach loop you have for processing each line works correctly. 您用于处理每一行的foreach循环可以正常工作。 However, if your words have periods, commas or other punctuation, you may want to remove those items before counting the word in a hash. 但是,如果您的单词带有句点,逗号或其他标点符号,则可能需要先删除这些项目,然后再将其计算为哈希值。

For the next part, you asked about determining the words with the highest counts. 在下一部分中,您询问了如何确定计数最高的单词。 In that case, you want make another hash which has a key of counts (for each word), and the value of that hash is a list/array of words associated with that number of counts. 在这种情况下,您要创建另一个具有计数键(每个单词)的哈希,并且该哈希的值是与该计数数量关联的单词的列表/数组。 Something like: 就像是:

# Hash with key being a number and value a list of words for that number
my %totals= ();
# Temporary variable to store occurrences (counts) of the word
my $wordTotal;
# $w is the words in the counts hash
foreach my $w ( keys %counts ) 
{
  # Get the counts for the word
  $wordTotal = $counts{$w};
  # value of the hash is an array, so de-reference the array ( the @{ }, 
  # and push the value of the counts array onto the array
  push @{ $totals{$wordTotal} },$w;  # the key to total is the value of the count hash
                                     # for which the words ($w) are the keys
}

To get the words with the highest counts you need to get the keys from the total and reverse a sorted list (numerically sorted) to get the N number of highest. 要获得具有最高计数的单词,您需要从总数中获取键,并反转排序列表(数字排序)以获取N个最高的数字。 Since we have an array of values, we will have to count each output to get the N number of highest counts. 由于我们有一个值数组,因此我们将必须对每个输出进行计数以获得N个最高计数值。

# Number of items outputted
my $current = 0;
# sort the total (keys) and reverse the list so the highest values are first
# and go through the list
foreach my $t ( reverse sort { $a <=> $b} keys %totals) # Use the numeric 
                                                        # comparison in 
                                                        # the sort 
{
   # Since each value of total hash is an array of words,
   # loop through that array for the values and print out the number 
   foreach my $w ( sort @{$total{$t}}
   {
     # Print the number for the count of words
     print "$t\n";
     # Increment the number output
     $current++;
     # if this is the number to be printed, we are done 
     last if ( $current == $ARGV[1] );
   }
   # if this is the number to be printed, we are done 
   last if ( $current == $ARGV[1] );
 }

The third part of printing to a file, it is unclear what "them" is (words, counts or both; limited to top ones or all of the words) from your question. 打印到文件的第三部分,尚不清楚您的问题中的“它们”是什么(单词,计数或两者;限于前几个或所有单词)。 I will leave that effort for you to open a file, print out the information to the file and close the file. 我将把精力放在打开文件,将信息打印到文件上并关闭文件上。

I have figured it out. 我已经知道了。 The following is my solution. 以下是我的解决方案。 I'm not sure if it's the best way to do it, but it works. 我不确定这是否是最好的方法,但确实有效。

    # Check if there are three arguments in the commandline
    if (@ARGV < 3) {
       die "ERROR: There must be three arguments!\n";
       exit;
    }
    # Open the file
    my $file = shift or die "ERROR: $0 FILE\n";
    open my $fh,'<', $file or die "ERROR: Could not open file!";
    # Check if it is a file
    if (-f $fh) {
       print("This is a file!\n");
       # Go through each line
       while (my $line = <$fh>) {
          chomp $line;
          # Count the occurrences of each word
          foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
             $count{$str}++;
          }
       }
    }

    # Check if the INPUT is a directory
    if (-d $input) {
       # Call subroutine to search directory recursively
       search_dir($input);
    }
    # Close the file
    close($fh);
    $high_count = 0;
    # Open the file
    open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
    # Sort the most occurring words in the file and print them
    foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
       $high_count++;
       if ($high_count <= $num) {
          printf "%-31s %s\n", $str, $count{$str};
       }
       printf $fileh "%-31s %s\n", $str, $count{$str};
    }
    exit;

    # Subroutine to search through each directory recursively
    sub search_dir {
       my $path = shift;
       my @dirs = glob("$path/*");
       # Loop through filenames
       foreach my $filename (@dirs) {
          # Check if it is a file
          if (-f $filename) {
             # Open the file
             open(FILE, $filename) or die "ERROR: Can't open file";
             # Go through each line
             while (my $line = <FILE>) {
                chomp $line;
                # Count the occurrences of each word
                foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
                   $count{$str}++;
                }
             }
             # Close the file
             close(FILE);
          }
          elsif (-d $filename) {
             search_dir($filename);
          }
       }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM