简体   繁体   English

使用Perl计算文本文件中的唯一项

[英]Count Unique Items in Text File Using Perl

I have a text file with thousands of names listed First Name | 我有一个文本文件,其中列出了数千个名字。 Last Name. 姓。 Any examples on how to use Perl to count only the unique last names? 关于如何使用Perl仅计算唯一姓氏的任何示例?

I am already doing a standard count using $count++ to get the grand total, but I still need to know a unique count. 我已经在使用$ count ++进行标准计数以获得总计,但是我仍然需要知道唯一计数。

Thanks for any suggestions! 感谢您的任何建议!

The standard approach is to use a hash (associative array) whose keys are the strings you want to count. 标准方法是使用哈希(关联数组),其键是您要计数的字符串。 Since a hash only includes a given key at most once, this lets you count distinct strings. 由于哈希最多只包含一个给定的密钥,因此您可以计算不同的字符串。 For example: 例如:

my @input_list = ('a', 'b', 'a', 'b', 'a');
my %result_hash;
foreach my $val (@input_list) {
    ++$result_hash{$val};
}
# %result_hash is now (a => 3, b => 2)
print scalar keys %result_hash; # prints '2' (the number of keys)

Because keys of a hash are always unique, I suggest that you can make elements, which should be unique in a set, the keys of a hash. 因为哈希键始终是唯一的,所以我建议您可以使哈希键在集合中应该是唯一的元素。 In your case, using a hash that have last names as keys to remove repetitive ones and to count the number of people that have each last name. 在您的情况下,使用具有姓氏的哈希作为键来删除重复的名字并计算具有每个姓氏的人数。

$nameList=['Eric|Johnson',
            'Herbert|Schildt',
            'Carl|Schildt',
            'Rose|Johnson',
            'Allen|Johnson',];
$nameHash={};
map {$_=~/\|(\w+)/; $nameHash->{$1}+=1;} @{$nameList};
# read the string behind "|"
# subsequently, make this string the key of hash
# increase the value of this key to count the number of people having this last name
map {print "$_: $nameHash->{$_} people\n"} keys %{$nameHash};

Then, you can get the result like 然后,您可以获得类似的结果

Johnson: 3 people
Schildt: 2 people

All in all, recall hash anytime you want a set. 总而言之,您随时都可以调用哈希。 Cheers! 干杯!

Another way of doing it, hopefully a bit more readable: 这样做的另一种方式是,希望它更具可读性:

#!/usr/bin/perl
use strict;
use warnings;

my %names = ();
while (  my $name = <DATA>)
{
     chomp($name);
     my ($fname, $lname) = split(/\|/, $name);
     if (exists($names{$lname}))
     {
       $names{$lname} += 1;
     }
     else
     {
       $names{$lname} = 1;
     }

}

foreach my $name (sort { $names{$b} <=> $names{$a} } keys %names) {
  printf "%s: %s\n", $name, $names{$name};
}

print "Unique Names: " . scalar keys(%names) . "\n";

__DATA__
Rosetta|Drayer
Celinda|Blaylock
Twana|Riner
Mike|Riner
Bob|Riner
Linda|Riner
Liliana|Littlejohn
John|Littlejohn
Candance|Candanoza
Brian|Candanoza
George|Candanoza
Noreen|Frandsen
Nakisha|Feltmann
Vanetta|Feltmann
Lorretta|Feltmann
Domenic|Feltmann
Madalene|Feltmann
Rosalinda|Feltmann
Brandie|Feltmann
Nu|Feltmann
Tennille|Feltmann

Output - sorted by number descending order 输出 -按数字降序排列

Feltmann: 9
Riner: 4
Candanoza: 3
Littlejohn: 2
Frandsen: 1
Drayer: 1
Blaylock: 1

Unique Names: 7

This is another way using the uniq function: 这是使用uniq函数的另一种方式:

#!/usr/bin/perl
use strict;
use warnings;

use List::MoreUtils 'uniq';


my @names = ();
while (  my $name = <DATA>)
{
     chomp($name);
     my ($fname, $lname) = split(/\|/, $name);
     push(@names, $lname);
}

my @uniq = uniq @names;
print "Unique Names: " . scalar @uniq . "\n";

__DATA__
Rosetta|Drayer
Celinda|Blaylock
Twana|Riner
Mike|Riner
Bob|Riner
Linda|Riner
Liliana|Littlejohn
John|Littlejohn
Candance|Candanoza
Brian|Candanoza
George|Candanoza
Noreen|Frandsen
Nakisha|Feltmann
Vanetta|Feltmann
Lorretta|Feltmann
Domenic|Feltmann
Madalene|Feltmann
Rosalinda|Feltmann
Brandie|Feltmann
Nu|Feltmann
Tennille|Feltmann

Output 输出量

Unique Names: 7

Just use a hash to keep track of the values, and then count at the end: 只需使用散列来跟踪值,然后在末尾计数:

perl -lne '
     my ($ln) = (split /\s*\|\s*/)[1];
     $h{$ln}++;
     END { print scalar keys %h }
  ' file.txt

What you want is a dictionary. 您想要一本字典。 You would read the lines, one by one (in a while loop, probably), remove newline characters, split by the pipe character, so you have a variable, say $lastname , which has the field you want. 您将一个接一个地读取行(可能是在while循环中),删除换行符,然后用竖线字符分开,这样便有了一个变量,例如$lastname ,该变量具有所需的字段。

Then, you do the following: $count{$lastname}++ . 然后,执行以下操作: $count{$lastname}++

Note that $count{$lastname} is completely unrelated to $count ; 请注意, $count{$lastname}$count完全无关; it is a separate variable. 它是一个单独的变量。

Once your loop is complete, you can go through each lastname: foreach $lastname (keys(%count)) {... and print out $lastname and $count{$lastname} . 循环完成后,您可以遍历每个姓: foreach $lastname (keys(%count)) {...并打印出$lastname$count{$lastname}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM