简体   繁体   English

比较 perl 中的两个列表时找到额外的、缺失的、无效的字符串

[英]find extra, missing, invalid strings when comparing two lists in perl

List-1    List-2
one       one
two       three
three     three
four      four
five      six
six       seven
eight     eighttt
nine      nine

Looking to output期待输出

one       | one        PASS
two       | *               FAIL MISSING
three     | three      PASS
*         | three           FAIL EXTRA
four      | four       PASS
five      | *               FAIL MISSING
six       | six        PASS
*         | seven           FAIL EXTRA
eight     | eighttt         FAIL INVALID
nine      | nine       PASS

Actually the return from my current solution is a reference to the two modified lists and a reference to a "fail" list describing the failure for the index as either "no fail", "missing", "extra", or "invalid" which is also (obviously) fine output.实际上,我当前解决方案的返回是对两个修改后的列表的引用和对“失败”列表的引用,该列表将索引的失败描述为“无失败”、“缺失”、“额外”或“无效”,其中也是(显然)很好的输出。

My current solution is:我目前的解决方案是:

sub compare {
    local $thisfound = shift;
    local $thatfound = shift;
    local @thisorig = @{ $thisfound };
    local @thatorig = @{ $thatfound };
    local $best = 9999; 

    foreach $n (1..6) {
        local $diff = 0;
        local @thisfound = @thisorig;
        local @thatfound = @thatorig;
        local @fail = ();
        for (local $i=0;$i<scalar(@thisfound) || $i<scalar(@thatfound);$i++) {
            if($thisfound[$i] eq $thatfound[$i]) { 
                $fail[$i] = 'NO_FAIL';
                next;
            }
            if($n == 1) {      # 1 2 3
                next unless __compare_missing__();
                next unless __compare_extra__();
                next unless __compare_invalid__();
            } elsif($n == 2) { # 1 3 2
                next unless __compare_missing__();
                next unless __compare_invalid__();
                next unless __compare_extra__();
            } elsif($n == 3) { # 2 1 3
                next unless __compare_extra__();
                next unless __compare_missing__();
                next unless __compare_invalid__();
            } elsif($n == 4) { # 2 3 1
                next unless __compare_extra__();
                next unless __compare_invalid__();
                next unless __compare_missing__();
            } elsif($n == 5) { # 3 1 2
                next unless __compare_invalid__();
                next unless __compare_missing__();
                next unless __compare_extra__();
            } elsif($n == 6) { # 3 2 1
                next unless __compare_invalid__();
                next unless __compare_extra__();
                next unless __compare_missing__();
            }
            push @fail,'INVALID'; 
            $diff += 1;
        }
        if ($diff<$best) {
            $best = $diff;
            @thisbest = @thisfound;
            @thatbest = @thatfound;
            @failbest = @fail;
        }
    }
    return (\@thisbest,\@thatbest,\@failbest)
}

sub __compare_missing__ {
    my $j;
    ### Does that command match a later this command? ###
    ### If so most likely a MISSING command           ###
    for($j=$i+1;$j<scalar(@thisfound);$j++) {
        if($thisfound[$j] eq $thatfound[$i]) {
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'MISSING'); }
            @end = @thatfound[$i..$#thatfound];
            @thatfound = @thatfound[0..$i-1];
            for ($i..$j-1) { push(@thatfound,'*'); }
            push(@thatfound,@end);
            $i=$j-1;
            last;
        }
    }
    $j == scalar(@thisfound);
}

sub __compare_extra__ {
    my $j;
    ### Does this command match a later that command? ###
    ### If so, most likely an EXTRA command           ###
    for($j=$i+1;$j<scalar(@thatfound);$j++) {
        if($thatfound[$j] eq $thisfound[$i]) { 
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'EXTRA'); }
            @end = @thisfound[$i..$#thisfound];
            @thisfound = @thisfound[0..$i-1];
            for ($i..$j-1) { push (@thisfound,'*'); }
            push(@thisfound,@end);
            $i=$j-1;
            last; 
        }
    }
    $j == scalar(@thatfound);
}

sub __compare_invalid__ {
    my $j;
    ### Do later commands match?                      ###
    ### If so most likely an INVALID command          ###
    for($j=$i+1;$j<scalar(@thisfound);$j++) {
        if($thisfound[$j] eq $thatfound[$j]) { 
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'INVALID'); }
            $i=$j-1;
            last;
        }
    }
    $j == scalar(@thisfound);
}

But this isn't perfect ... who wants to simplify and improve?但这并不完美……谁想简化和改进? Specifically ... within a single data set, one order of searching is better for a subset and another order is better for a different subset.具体来说......在单个数据集中,一个搜索顺序对一个子集更好,另一个顺序对另一个子集更好。

If the arrays contain duplicate values, the answer is quite a bit more complicated than that.如果数组包含重复值,则答案要复杂得多。

See eg Algorithm::Diff or read about Levenshtein distance .参见例如Algorithm::Diff或阅读Levenshtein distance

The trick in Perl (and similar languages) is the hash, which doesn't care about order. Perl(和类似语言)中的技巧是散列,它不关心顺序。

Suppose that the first array is the one that hold the valid elements.假设第一个数组是保存有效元素的数组。 Construct a hash with those values as keys:用这些值作为键构造一个散列:

  my @valid = qw( one two ... );
  my %valid = map { $_, 1 } @valid;

Now, to find the invalid elements, you just have to find the ones not in the %valid hash:现在,要找到无效元素,您只需找到不在%valid哈希中的元素:

  my @invalid = grep { ! exists $valid{$_} } @array;

If you want to know the array indices of the invalid elements:如果您想知道无效元素的数组索引:

  my @invalid_indices = grep { ! exists $valid{$_} } 0 .. $#array;

Now, you can expand that to find the repeated elements too.现在,您也可以展开它以查找重复的元素。 Not only do you check the %valid hash, but also keep track of what you have already seen:您不仅要检查%valid哈希,还要跟踪您已经看到的内容:

 my %Seen;
 my @invalid_indices = grep { ! exists $valid{$_} && ! $Seen{$_}++ } 0 .. $#array;

The repeated valid elements are the ones with a value in %Seen that is greater than 1:重复的有效元素是%Seen中值大于 1 的元素:

 my @repeated_valid = grep { $Seen{$_} > 1 } @valid;

To find the missing elements, you look in %Seen to check what isn't in there.要找到丢失的元素,您可以查看%Seen以检查其中没有的元素。

 my @missing = grep { ! $Seen{$_ } } @valid;

From perlfaq4 's answer to How can I tell whether a certain element is contained in a list or array?perlfaq4如何判断某个元素是否包含在列表或数组中的回答 :


(portions of this answer contributed by Anno Siegel and brian d foy) (此答案的部分内容由 Anno Siegel 和 brian d foy 提供)

Hearing the word "in" is an indication that you probably should have used a hash, not a list or array, to store your data.听到“in”一词表明您可能应该使用散列而不是列表或数组来存储数据。 Hashes are designed to answer this question quickly and efficiently.哈希旨在快速有效地回答这个问题。 Arrays aren't.数组不是。

That being said, there are several ways to approach this.话虽如此,有几种方法可以解决这个问题。 In Perl 5.10 and later, you can use the smart match operator to check that an item is contained in an array or a hash:在 Perl 5.10 及更高版本中,您可以使用智能匹配运算符来检查项目是否包含在数组或散列中:

use 5.010;

if( $item ~~ @array )
    {
    say "The array contains $item"
    }

if( $item ~~ %hash )
    {
    say "The hash contains $item"
    }

With earlier versions of Perl, you have to do a bit more work.对于早期版本的 Perl,您必须做更多的工作。 If you are going to make this query many times over arbitrary string values, the fastest way is probably to invert the original array and maintain a hash whose keys are the first array's values:如果要对任意字符串值多次进行此查询,最快的方法可能是反转原始数组并维护一个哈希,其键是第一个数组的值:

@blues = qw/azure cerulean teal turquoise lapis-lazuli/;
%is_blue = ();
for (@blues) { $is_blue{$_} = 1 }

Now you can check whether $is_blue{$some_color}.现在您可以检查是否 $is_blue{$some_color}。 It might have been a good idea to keep the blues all in a hash in the first place.首先将蓝调保持在散列中可能是个好主意。

If the values are all small integers, you could use a simple indexed array.如果值都是小整数,则可以使用简单的索引数组。 This kind of an array will take up less space:这种数组将占用更少的空间:

@primes = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31);
@is_tiny_prime = ();
for (@primes) { $is_tiny_prime[$_] = 1 }
# or simply  @istiny_prime[@primes] = (1) x @primes;

Now you check whether $is_tiny_prime[$some_number].现在检查是否 $is_tiny_prime[$some_number]。

If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead:如果有问题的值是整数而不是字符串,则可以通过使用位字符串来节省大量空间:

@articles = ( 1..10, 150..2000, 2017 );
undef $read;
for (@articles) { vec($read,$_,1) = 1 }

Now check whether vec($read,$n,1) is true for some $n.现在检查 vec($read,$n,1) 对于某些 $n 是否为真。

These methods guarantee fast individual tests but require a re-organization of the original list or array.这些方法保证了快速的单个测试,但需要重新组织原始列表或数组。 They only pay off if you have to test multiple values against the same array.只有当您必须针对同一个数组测试多个值时,它们才会得到回报。

If you are testing only once, the standard module List::Util exports the function first for this purpose.如果您只测试一次,则标准模块 List::Util 为此首先导出函数。 It works by stopping once it finds the element.它的工作原理是在找到元素后停止。 It's written in C for speed, and its Perl equivalent looks like this subroutine:它是用 C 编写的以提高速度,它的 Perl 等效项如下所示:

sub first (&@) {
    my $code = shift;
    foreach (@_) {
        return $_ if &{$code}();
    }
    undef;
}

If speed is of little concern, the common idiom uses grep in scalar context (which returns the number of items that passed its condition) to traverse the entire list.如果速度无关紧要,常见的习惯用法是在标量上下文中使用 grep(它返回通过其条件的项目数)来遍历整个列表。 This does have the benefit of telling you how many matches it found, though.不过,这确实有利于告诉您它找到了多少匹配项。

my $is_there = grep $_ eq $whatever, @array;

If you want to actually extract the matching elements, simply use grep in list context.如果您想实际提取匹配的元素,只需在列表上下文中使用 grep 即可。

my @matches = grep $_ eq $whatever, @array;
sub compare {
    local @d = ();

    my $this = shift;
    my $that = shift;
    my $distance = _levenshteindistance($this, $that);

    my @thisorig = @{ $this };
    my @thatorig = @{ $that };

    my $s = $#thisorig;
    my $t = $#thatorig;

    @this = ();
    @that = ();
    @fail = ();

    while($s>0 || $t>0) {
        #                  deletion,    insertion,   substitution
        my $min = _minimum($d[$s-1][$t],$d[$s][$t-1],$d[$s-1][$t-1]);
        if($min == $d[$s-1][$t-1]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,$thatorig[$t]);
            if($d[$s][$t] > $d[$s-1][$t-1]) {
                unshift(@fail,'INVALID');
            } else {
                unshift(@fail,'NO_FAIL');
            }
            $s -= 1;
            $t -= 1;
        } elsif($min == $d[$s][$t-1]) {
            unshift(@this,'*');
            unshift(@that,$thatorig[$t]);
            unshift(@fail,'EXTRA');
            $t -= 1;
        } elsif($min == $d[$s-1][$t]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,'*');
            unshift(@fail,'MISSING');
            $s -= 1;
        } else {
            die("Error! $!");
        }
    }

    return(\@this, \@that, \@fail);

}

sub _minimum {
    my $ret = 2**53;
    foreach $in (@_) {
        $ret = $ret < $in ? $ret : $in;
    }
    $ret;
}

sub _levenshteindistance {
    my $s = shift;
    my $t = shift;
    my @s = @{ $s };
    my @t = @{ $t };

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i] = ();
    }

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i][0] = $i # deletion
    }
    for(my $j=0;$j<scalar(@t);$j++) {
        $d[0][$j] = $j # insertion
    }

    for(my $j=1;$j<scalar(@t);$j++) {
        for(my $i=1;$i<scalar(@s);$i++) {
            if ($s[$i] eq $t[$j]) {
                $d[$i][$j] = $d[$i-1][$j-1];
            } else {
                #                    deletion,      insertion,     substitution
                $d[$i][$j] = _minimum($d[$i-1][$j]+1,$d[$i][$j-1]+1,$d[$i-1][$j-1]+1);
            }
        }
    }

    foreach $a (@d) {
        @a = @{ $a };
        foreach $b (@a) {
            printf STDERR "%2d ",$b;
        }
        print STDERR "\n";
    }

    return $d[$#s][$#t];
}

From perlfaq4 's answer to How do I compute the difference of two arrays?perlfaq4如何计算两个数组的差异的回答 How do I compute the intersection of two arrays?如何计算两个数组的交集? :


Use a hash.使用哈希。 Here's code to do both and more.这是可以同时执行以上操作的代码。 It assumes that each element is unique in a given array:它假设每个元素在给定数组中都是唯一的:

@union = @intersection = @difference = ();
%count = ();
foreach $element (@array1, @array2) { $count{$element}++ }
foreach $element (keys %count) {
    push @union, $element;
    push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element;
    }

Note that this is the symmetric difference, that is, all elements in either A or in B but not in both.请注意,这是对称差异,即 A 或 B 中的所有元素但不是两者中的所有元素。 Think of it as an xor operation.将其视为异或运算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM