简体   繁体   English

以公差匹配数组中的值

[英]Matching Values in Array with Tolerance

I'm trying to weed out duplicate values in an array, which I'm successfully accomplishing with the "List::MoreUtils uniq/distinct" function. 我正在尝试清除数组中的重复值,这是我使用“ List :: MoreUtils uniq / distinct”函数成功完成的。

However, I would also like to count those values that fall within a given tolerance, say +-5, as duplicates as well (I think tolerance is also sometimes referred to as "delta"). 但是,我还要将属于给定公差(例如+ -5)内的那些值也算作重复项(我认为公差有时也称为“增量”)。

For example, if 588 is a value in the array, but so is 589, because the difference falls within the tolerance of 5, 589 gets the boot. 例如,如果588是数组中的值,但589也是,因为差异落在5的容差范围内,所以589获得引导。

Without some nasty/costly cross-checking of arrays, is there an elegant way to do this? 如果不对数组进行一些讨厌/昂贵的交叉检查,是否有一种优雅的方法可以做到这一点?

EDIT: ikegami brought to my attention some ambiguity in my question and I'm having a bit of a hard time wrapping my head around the problem. 编辑:池上引起了我的疑问,在我的问题上有些模棱两可,而我在解决这个问题上有些困难。 However, I think I have it worked out. 但是,我认为我已经解决了。

[500,505,510,515,525,900]

If you try to match the values throughout the entire array, you should get: 如果尝试在整个数组中匹配值,则应获得:

[500,510,525,900]

It hits 505, sees it as non-unique, removes it from the array, then sees 510 as newly-unique due to the absence of 505, and so on. 它命中505,将其视为不唯一,将其从数组中删除,然后由于缺少505,因此将其视为510为新唯一,依此类推。 This, I imagine is the way I outlined my original question, but on reflection, it seems it's a useless and fairly arbitrary data set. 我想这就是我概述原始问题的方式,但是经过反思,这似乎是一个无用且相当武断的数据集。

What I really want is the following match: 我真正想要的是以下比赛:

[500,900]

It represents a group of numbers that are within 5 of each other, while also spotting the vast variance in the 900 value. 它代表一组彼此在5个以内的数字,同时还发现900值的巨大差异。 This seems to be more useful information than the former and it appears that perreal's answer gets me close. 这似乎是比前者更有用的信息,并且似乎Perreal的答案使我接近了。 Sorry for the confusion, and many thanks to ikegami as well as perreal for forcing my clarification. 抱歉给您带来的困惑,非常感谢池上和perreal的努力,这让我很清楚。

EDIT 2 An even better match would be: 编辑2更好的匹配是:

[510,900]

510, being the median of all the sequential +-5 values. 510,是所有连续+ -5值的中位数。

However, I recognize that now we're deviating severely from my original question, so I would be more than happy with an answer to my EDIT 1 clarification. 但是,我知道现在我们与我的原始问题有很大的出入,所以我对我的EDIT 1澄清的答案非常满意。

This is a deceptively complex problem, as the data must not only be organized into groups, but also those groups must be combined if a new data point is seen that belongs to more than one of them. 这是一个看似复杂的问题,因为不仅必须将数据组织成组,而且如果看到一个新数据点不止一个组,则必须将这些组合并。

This program seems to do what you need. 该程序似乎可以满足您的需求。 It keeps a list of arrays @buckets , where each element contains all values seen so far that is within TOLERANCE of one other. 它保留一个数组@buckets的列表,其中每个元素包含到目前为止看到的所有值,这些值在彼此的TOLERANCE之内。 This list is scanned to see if each value falls within range of the maximum and minimum values already present. 扫描此列表以查看每个值是否落在已经存在的最大值和最小值的范围内。 The index of the groups that the value belongs to are stored in memberof , and there will always be zero, one or two entries in this array. 值所属的组的索引存储在memberof ,并且此数组中始终有零个,一个或两个条目。

All the groups specified by @memberof are removed from @buckets , combined together with the new data value, sorted, and replaced as a new group in the list. 由指定的所有基团@memberof从除去@buckets ,用新的数据值组合在一起,分类,并更换为在列表中的新组。

At the end the @buckets array is converted to a list of median values, sorted and displayed. 最后, @buckets数组将转换为中值列表,并进行排序和显示。 I have used Data::Dump to show the contents of the groups before they are aggregated to their median values. 我已经使用Data::Dump来显示组的内容,然后再将它们汇总到中位数。

To generate your desired output 510, 900 from the list 500, 510, 525, 900 the value for TOLERANCE must be increased so that values that differ by 15 or less are combined. 为了产生所需输出510, 900从列表500, 510, 525, 900为值TOLERANCE必须增加,使得由15个或更少不同值组合。

use strict;
use warnings;

use constant TOLERANCE => 5;

my @data = qw/ 500 505 510 515 525 900 /;

my @buckets;

for my $item (@data) {

  my @memberof;
  for my $i (0 .. $#buckets) {
    if ($item >= $buckets[$i][0] - TOLERANCE and $item <= $buckets[$i][-1] + TOLERANCE) {
      push @memberof, $i;
    }
  }

  my @newbucket = ($item);
  for my $i (reverse @memberof) {
    push @newbucket, @{ splice @buckets, $i, 1 };
  }

  push @buckets, [ sort { $a <=> $b } @newbucket ];
}

use Data::Dump;
dd @buckets;

@buckets = sort { $a <=> $b } map median(@$_), @buckets;
print join(', ', @buckets), "\n";

sub median {

  my $n = @_;
  my $i = $n / 2;

  if ($n % 2) {
    return $_[$i];
  }
  else {
    return ($_[$i-1] + $_[$i]) / 2;
  }
}

output 产量

([500, 505, 510, 515], [525], [900])
507.5, 525, 900

Isolate the samples that form a chain where each is within the tolerance of the next, then choose one from that group. 隔离形成链的样本,每个样本都在下一个样本的容许范围内,然后从该组中选择一个样本。

sub collapse {
   my $tol = shift;

   my @collapsed;
   while (@_) {
      my @group = shift(@_);
      while (@_ && $group[-1] + $tol >= $_[0]) {
         push @group, shift(@_);
      }

      push @collapsed, choose_from(@group);
   }

   return @collapsed;
}

say join ',', collapse(5 => 500,505,510,515,525,900);

So how do you choose? 那么,您如何选择呢? Well, you could return the average. 好吧,您可以返回平均值。

use List::Util qw( sum );

sub choose_from {
   return sum(@_)/@_;
}

# Outputs: 507.5,525,900

Or you could return the median. 或者您可以返回中位数。

use List::Util qw( sum );

sub choose_from {
   my $median;
   if (@_ % 2 == 0) {
      my $avg = sum(@_)/@_;
      my $diff0 = abs( $_[ @_/2 - 1 ] - $avg );
      my $diff1 = abs( $_[ @_/2 - 0 ] - $avg );
      if ($diff0 <= $diff1) {
         return $_[ @_/2 - 1 ];
      } else {
         return $_[ @_/2 - 0 ];
      }
   } else {
      return $_[ @_/2 ];
   }
}

# Outputs: 505,525,900

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM