简体   繁体   English

返回带有容差的唯一元素

[英]Return Unique Element with a Tolerance

In Matlab, there is this unique command that returns thew unique rows in an array. 在Matlab中,有一个unique命令可以返回数组中的唯一行。 This is a very handy command. 这是一个非常方便的命令。

But the problem is that I can't assign tolerance to it-- in double precision, we always have to compare two elements within a precision. 但问题在于我不能为它赋予容差 - 在双精度中,我们总是需要比较精度内的两个元素。 Is there a built-in command that returns unique elements, within a certain tolerance? 是否有内置命令在特定容差范围内返回唯一元素?

With R2015a, this question finally has a simple answer (see my other answer to this question for details). 使用R2015a,这个问题最终得到了一个简单的答案(详情请参阅我对这个问题的其他答案 )。 For releases prior to R2015a, there is such a built-in (undocumented) function: _mergesimpts . 对于R2015a之前的版本,有一个内置的(未记录的)函数: _mergesimpts A safe guess at the composition of the name is "merge similar points". 对名称组成的安全猜测是“合并相似点”。

The function is called with the following syntax: 使用以下语法调用该函数:

xMerged = builtin('_mergesimpts',x,tol,[type])

The data array x is N-by-D , where N is the number of points, and D is the number of dimensions. 数据阵列xN-by-D ,其中N是点数, D是维数。 The tolerances for each dimension are specified by a D -element row vector, tol . 每个维度的容差由D元素行向量tol The optional input argument type is a string ( 'first' (default) or 'average' ) indicating how to merge similar elements. 可选的输入参数type是一个字符串( 'first' (默认)或'average' ),表示如何合并类似的元素。

The output xMerged will be M-by-D , where M<=N . 输出xMerged将是M-by-D ,其中M<=N It is sorted . 它已分类

Examples, 1D data : 例子,1D数据

>> x = [1; 1.1; 1.05];             % elements need not be sorted
>> builtin('_mergesimpts',x,eps)   % but the output is sorted
ans =
    1.0000
    1.0500
    1.1000

Merge types: 合并类型:

>> builtin('_mergesimpts',x,0.1,'first')
ans =
    1.0000  % first of [1, 1.05] since abs(1 - 1.05) < 0.1
    1.1000
>> builtin('_mergesimpts',x,0.1,'average')
ans =
    1.0250  % average of [1, 1.05]
    1.1000
>> builtin('_mergesimpts',x,0.2,'average')
ans =
    1.0500  % average of [1, 1.1, 1.05]

Examples, 2D data : 示例,2D数据

>> x = [1 2; 1.06 2; 1.1 2; 1.1 2.03]
x =
    1.0000    2.0000
    1.0600    2.0000
    1.1000    2.0000
    1.1000    2.0300

All 2D points unique to machine precision: 机床精度所特有的所有2D点:

>> xMerged = builtin('_mergesimpts',x,[eps eps],'first')
xMerged =
    1.0000    2.0000
    1.0600    2.0000
    1.1000    2.0000
    1.1000    2.0300

Merge based on second dimension tolerance: 基于第二维度容差的合并:

>> xMerged = builtin('_mergesimpts',x,[eps 0.1],'first')
xMerged =
    1.0000    2.0000
    1.0600    2.0000
    1.1000    2.0000   % first of rows 3 and 4
>> xMerged = builtin('_mergesimpts',x,[eps 0.1],'average')
xMerged =
    1.0000    2.0000
    1.0600    2.0000
    1.1000    2.0150   % average of rows 3 and 4

Merge based on first dimension tolerance: 基于第一维度容差进行合并:

>> xMerged = builtin('_mergesimpts',x,[0.2 eps],'average')
xMerged =
    1.0533    2.0000   % average of rows 1 to 3
    1.1000    2.0300
>> xMerged = builtin('_mergesimpts',x,[0.05 eps],'average')
xMerged =
    1.0000    2.0000
    1.0800    2.0000   % average of rows 2 and 3
    1.1000    2.0300   % row 4 not merged because of second dimension

Merge based on both dimensions: 基于两个维度合并:

>> xMerged = builtin('_mergesimpts',x,[0.05 .1],'average')
xMerged =
    1.0000    2.0000
    1.0867    2.0100   % average of rows 2 to 4

This is a difficult problem. 这是一个难题。 I'd even claim it to be impossible to solve in general, because of what I'd call the transitivity problem. 我甚至声称它一般不可能解决,因为我称之为传递性问题。 Suppose that we have three elements in a set, {A,B,C}. 假设我们在集合中有三个元素{A,B,C}。 I'll define a simple function isSimilarTo, such that isSimilarTo(A,B) will return a true result if the two inputs are within a specified tolerance of each other. 我将定义一个简单的函数isSimilarTo,如果两个输入在彼此的指定容差范围内,则isSimilarTo(A,B)将返回真实结果。 (Note that everything I will say here is meaningful in one dimension as well as in multiple dimensions.) So if two numbers are known to be "similar" to each other, then we will choose to group them together. (请注意,我在这里说的所有内容在一个维度和多维度上都是有意义的。)因此,如果已知两个数字彼此“相似”,那么我们将选择将它们组合在一起。

So suppose we have values {A,B,C} such that isSimilarTo(A,B) is true, and that isSimilarTo(B,C) is also true. 因此,假设我们有值{A,B,C},使得isSimilarTo(A,B)为真,那么类似于(B,C)也是如此。 Should we decide to group all three together, even though isSimilarTo(A,C) is false? 我们是否应该决定将所有三个组合在一起,即使isSimilarTo(A,C)是假的?

Worse, move to two dimensions. 更糟糕的是,转向两个维度。 Start with k points equally spaced around the perimeter of a circle. 从围绕圆周等距间隔的k个点开始。 Assume the tolerance is chosen such that any point is within the specified tolerance of its immediate neighbors, but not to any other point. 假设选择容差使得任何点都在其直接邻居的指定容差内,但不在任何其他点上。 How would you choose to resolve which points are "unique" in the setting? 您如何选择解决设置中哪些点“独特”?

I'll claim that this problem of intransitivity makes the grouping problem not possible to resolve, at least not perfectly, and certainly not in any efficient manner. 我会声称这种不及物的问题使分组问题无法解决,至少不完美,当然也不能以任何有效的方式解决。 Perhaps one might try an approach based on a k-means style of aggregation. 也许有人可能会尝试一种基于k-means聚合方式的方法。 But this will be quite inefficient, as well, such an approach generally needs to know in advance the number of groups to look for. 但是这也是非常低效的,这种方法通常需要事先知道要查找的组的数量。

Having said that, I would still offer a compromise, something that can sometimes work within limits. 话虽如此,我仍然会提供妥协,有时可以在限制范围内工作。 The trick is found in Consolidator , as found on the Matlab Central file exchange. 这个技巧可以在Consolidator中找到,可以在Matlab Central文件交换中找到。 My approach was to effectively round the inputs to within the specified tolerance. 我的方法是有效地将输入舍入到指定的容差范围内。 Having done that, a combination of unique and accumarray allows the aggregation to be done efficiently, even for large sets of data in one or many dimensions. 完成此操作后,独特和准确的组合可以有效地完成聚合,即使对于一维或多维的大型数据集也是如此。

This is a reasonable approach when the tolerance is large enough that when multiple pieces of data belong together, they will be rounded to the same value, with occasional errors made by the rounding step. 当公差足够大以至于当多个数据属于一起时,这是一种合理的方法,它们将四舍五入到相同的值,并且通过舍入步骤偶尔会产生错误。

As of R2015a , there is finally a function to do this, uniquetol ( before R2015a , see my other answer ): 作为R2015a的 ,还有最后做这样的功能, uniquetol (R2015a之前 ,看到我的其他答案 ):

uniquetol Set unique within a tolerance. uniquetol在公差范围内设置独特。

uniquetol is similar to unique . uniquetol类似于unique Whereas unique performs exact comparisons, uniquetol performs comparisons using a tolerance. 虽然unique执行精确比较,但uniquetol使用容差执行比较。

The syntax is straightforward: 语法很简单:

C = uniquetol(A,TOL) returns the unique values in A using tolerance TOL . C = uniquetol(A,TOL)使用公差TOL返回A的唯一值。

As are the semantics: 和语义一样:

Each value of C is within tolerance of one value of A , but no two elements in C are within tolerance of each other. C每个值都在A的一个值的容差范围内,但C中没有两个元素在彼此的容差范围内。 C is sorted in ascending order. C按升序排序。 Two values u and v are within tolerance if: 如果出现以下情况,两个值uv均在容差范
abs(uv) <= TOL*max(A(:),[],1)

It can also operate " ByRows ", and the tolerance can be scaled by an input " DataScale " rather than by the maximum value in the input data. 它也可以操作“ ByRows ”,公差可以通过输入“ DataScale ”而不是输入数据中的最大值来缩放。

But there is an important note about uniqueness of the solutions: 但是有一个关于解决方案唯一性的重要说明:

There can be multiple valid C outputs that satisfy the condition, "no two elements in C are within tolerance of each other." 可以有多个满足条件的有效C输出,“ C中没有两个元素在彼此的容差范围内。” For example, swapping columns in A can result in a different solution being returned, because the input is sorted lexicographically by the columns. 例如,在A交换列可能会导致返回不同的解决方案,因为输入按列按字典顺序排序。 Another result is that uniquetol(-A,TOL) may not give the same results as -uniquetol(A,TOL) . 另一个结果是,单uniquetol(-A,TOL)可能不会产生与-uniquetol(A,TOL)相同的结果。

There is also a new function ismembertol is related to ismember in the same way as above. 还有一个新功能ismembertol以与上面相同的方式与ismember相关。

There is no such function that I know of. 我不知道有这样的功能。 One tricky aspect is that if your tolerance is, say, 1e-10, and you have a vector with values that are equally spaced at 9e-11, the first and the third entry are not the same, but the first is the same as the second, and the second is the same as the third - so how many "uniques" are there? 一个棘手的方面是,如果你的容差是1e-10,并且你有一个数值等于9e-11的向量,第一个和第三个条目不一样,但第一个是相同的第二个,第二个和第三个相同 - 那么有多少“独特”?

One way to solve the problem is that you round your values to a desired precision, and then run unique on that. 解决问题的一种方法是将值舍入到所需的精度,然后在其上运行唯一。 You can do that using round2 ( http://www.mathworks.com/matlabcentral/fileexchange/4261-round2 ), or using the following simple way: 您可以使用round2( http://www.mathworks.com/matlabcentral/fileexchange/4261-round2 )或使用以下简单方法执行此操作:

r = rand(100,1); % some random data
roundedData = round(r*1e6)/1e6; % round to 1e-6
uniqueValues = unique(roundedData);

You could also do it using the hist command, as long as the precision is not too high: 您也可以使用hist命令执行此操作,只要精度不是太高:

r = rand(100,1); % create 100 random values between 0 and 1
grid = 0:0.001:1; % creates a vector of uniquely spaced values 
counts = hist(r,grid); % now you know for each element in 'grid' how many values there are
uniqueValues = grid(counts>0); % and these are the uniques

I've come across this problem before. 我之前遇到过这个问题。 The trick is to first sort the data and then use the diff function to find the difference between each item. 诀窍是首先对数据进行排序,然后使用diff函数查找每个项目之间的差异。 Then compare when that difference is less then your tolerance. 然后比较那个差异小于你的容差。 This is the code that I use: 这是我使用的代码:

tol = 0.001
[Y I] = sort(items(:));
uni_mask = diff([0; Y]) > tol;
%if you just want the unique items:
uni_items = Y(uni_mask); %in sorted order
uni_items = items(I(uni_mask));  % in the original order

This doesn't take care of "drifting" ... so something like 0:0.00001:100 would actually return one unique value. 这不会照顾“漂移”......所以像0:0.00001:100这样的东西实际上会返回一个唯一值。

If you want something that can handle "drifting" then I would use histc but you need to make some sort of rough guess as to how many items you're willing to have. 如果你想要一些可以处理“漂移”的东西,那么我会使用histc,但是你需要对你愿意拥有多少项进行某种粗略的猜测。

NUM = round(numel(items) / 10); % a rough guess
bins = linspace(min(items), max(items), NUM);
counts = histc(items, bins);
unit_items = bins(counts > 0);

BTW: I wrote this in a text-editor away from matlab so there may be some stupid typos or off by one errors. 顺便说一句:我是在远离matlab的文本编辑器中写的,所以可能会有一些愚蠢的拼写错误或一个错误。

Hope that helps 希望有所帮助

This is hard to define well, assume you have a tolerance of 1. Then what would be the outcome of [1; 2; 3; 4] 这很难很好地定义,假设你有一个1的容差。那么[1; 2; 3; 4] [1; 2; 3; 4] [1; 2; 3; 4] ? [1; 2; 3; 4]

When you have multiple columns a definition could become even more challenging. 当您有多个列时,定义可能会变得更具挑战性。

However, if you are mostly worried about rounding issues, you can solve most of it by one of these two approaches: 但是,如果您最担心舍入问题,可以通过以下两种方法之一解决大部分问题:

  1. Round all numbers (considering your tolerance), and then use unique 舍入所有数字(考虑您的容差),然后使用unique
  2. Start with the top row as your unique set, use ismemberf to determine whether each new row is unique and if so, add it to your unique set. 从顶行开始作为您的唯一集合,使用ismemberf确定每个新行是否唯一,如果是,则将其添加到您的唯一集合。

The first approach has the weakness that 0.499999999 and 0.500000000 may not be seen as duplicates. 第一种方法的缺点是0.499999999和0.500000000可能不会被视为重复。 Whilst the second approach has the weakness that the order of your input matters. 虽然第二种方法的缺点是输入的顺序很重要。

I was stuck the other day with a MatLab 2010 , so, no round(X,n), no _mergesimpts (At least I couldn't get it to work) so, a simple solution that works (at least for my data): 前几天我被MatLab 2010卡住了,所以,没有圆(X,n),没有_mergesimpts(至少我不能让它工作)所以,一个简单的解决方案(至少我的数据):

Using rat default tolerance: 使用rat默认公差:

unique(cellstr(rat(x)))

Other tolerance: 其他容忍度:

unique(cellstr(rat(x,tol)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM