简体   繁体   English

Perl-计算文件每一行中特定单词的出现

[英]Perl - Count occurrence of specific words for each line of file

Did a lot of searching, nothing quite what I wanted. 做了很多搜索,没有我想要的。 Perl noob here. Perl菜鸟在这里。

I have a text file already neatly organised into lines of data. 我有一个已经整齐地组织成数据行的文本文件。 Say the two strings I'm interested in are "hello" and "goodbye". 假设我感兴趣的两个字符串是“ hello”和“再见”。 I want to write a quick Perl script that will look at the first line and count how many times "hello" and "goodbye" occur. 我想编写一个快速的Perl脚本,它将查看第一行并计算“ hello”和“再见”发生了多少次。 Then it will go to the next line and do the counts, adding to the earlier counts. 然后它将转到下一行并进行计数,并添加到较早的计数中。 So by the end of the script I can print the total number of counts for each string in the file. 因此,在脚本末尾,我可以打印文件中每个字符串的计数总数。 The reason the line-by-line approach is important is because I want to use several counts so I can print the number of times both words are in the same line, the number of times a line contains just one of the words and not the other, the number of times a line contains "hello" once but "goodbye" multiple times etc. Really it's about the number of times each condition is found on a line, rather than how many times the words appear in the whole document. 逐行方法很重要的原因是因为我想使用多个计数,所以我可以打印两个单词在同一行中的次数,一行中仅包含一个单词而不是单词的次数。另外,一行包含一次“ hello”但多次包含“ byeby”等的次数。实际上,这是关于在一行中找到每个条件的次数,而不是单词在整个文档中出现的次数。

So far I'm thinking: 到目前为止,我在想:

#!/usr/bin/perl
use strict; use warnings;

die etc (saving time by not including it here)

my $word_a = "hello";
my $word_b = "goodbye";
my $single_both = 0; # Number of lines where both words appear only once.
my $unique_hello = 0; # Number of lines where only hello appears, goodbye doesn't.
my $unique_goodbye = 0; # Number of lines where goodbye appears, hello doesn't.
my $one_hello_multiple_goodbye = 0; # Number of lines where hello appears once and goodbye appears multiple times.
my $one_goodbye_multiple_hello = 0; # Number of lines where goodbye appears once and hello appears multiple times.
my $multiple_both = 0; = # Number of lines where goodbye and hello appear multiple times.

while (my $line = <>) {

Magic happens here

};

# then the results for each of those variables can be printed at the end.

As I said, I'm a noob. 正如我所说,我是菜鸟。 I'm confused about how to even count the occurrences in each line. 我对如何计算每一行中的出现次数感到困惑。 Even if I knew that I'm sure I could then figure out all the different conditions I've listed above. 即使我知道我确定自己也会确定上面列出的所有不同条件。 Should I be using arrays? 我应该使用数组吗? Hashes? 散列? Or have I approached this in entirely the wrong direction considering what I want. 还是考虑到我想要的东西,我完全朝错误的方向走了。 I need to count the number of lines that have the different conditions I've listed as comments after those variables. 我需要计算在这些变量之后具有作为注释列出的不同条件的行数。 Any help at all is greatly appreciated! 任何帮助都将不胜感激!

You can count occurrence of some word by regex, eg $hello = () = $line =~ /hello/g; 您可以通过正则表达式计算某个单词的出现次数,例如$hello = () = $line =~ /hello/g; counts hello occurrence in $line How it works? 计算$line hello发生情况。 如何工作?

perl -n -E '$hello = () = /hello/g; $goodbye = () = /goodbye/g; say "line $.: hello - $hello, goodbye - $goodbye"; $hello_total += $hello; $goodbye_total += $goodbye;}{say "total: hello - $hello_total, goodbye - $goodbye_total";' input.txt

output for some file: 某些文件的输出:

line 1: hello - 0, goodbye - 0
line 2: hello - 1, goodbye - 0
line 3: hello - 1, goodbye - 1
line 4: hello - 3, goodbye - 0
line 5: hello - 0, goodbye - 0
line 6: hello - 1, goodbye - 1
line 7: hello - 0, goodbye - 0
total: hello - 6, goodbye - 2

Perl has a binding operator =~ that tests if a string matches a pattern. Perl有一个绑定运算符=~ ,用于测试字符串是否与模式匹配。 You can use this in combination with two if statements to pull out the counts from all of your lines: 您可以将其与两个if语句结合使用,以从所有行中提取计数:

# only gathers counts
while (my $line = <STDIN>) {
   $hello_cnt++  if $line =~ /hello/;
   $goobye_cnt++ if $line =~ /goodbye/;
}

but it seems like you want to reason about your input line by line, and you could maintain all of those variables: $unique_hello , $unique_goodbye , etc... but that seems like a lot of extra work to me, what you can do is hash to total the counts: 但是似乎您想逐行推理输入,并且可以维护所有这些变量: $unique_hello$unique_goodbye等...但是这对我来说似乎是很多额外的工作,您可以做什么对所有计数进行哈希处理:

my %seen;
while (my $line = <STDIN>) {
   chomp $line;                   # remove trailing \n

   map {
      $seen{lc $_}++;
   } split /\s+/, $line;          # split on whitespace
}

Now you have a hash of this structure: 现在,您具有以下结构的哈希值:

{ 
  word1 => cnt1,
  word2 => cnt2,
  etc ...
}

Now you can just print the totals: 现在,您可以打印总计:

print "Hello seen " . $seen{hello} . " times";
# etc ...

I left off the line by line analysis for you do, hopefully this is a good starting point. 我为您做了逐行分析,希望这是一个很好的起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM