简体   繁体   English

比较两个标签向量相同的百分比

[英]Compare how to how much percentage two label-vectors are the same

Background 背景

I have two clustering methods that I want to compare. 我有两种要比较的聚类方法。 I cluster my data objects with the one method, then with the other and label the objects for both methods. 我用一种方法将数据对象聚类,然后用另一种方法聚类,并为这两种方法标记对象。 Now I would like to compare to what percentage the second method labels the data objects the same way as the first method. 现在,我想比较第二种方法与第一种方法标记数据对象的百分比。

Problem 问题

I have data objects with two types of labels. 我有带有两种类型标签的数据对象。 The labels are integers without any intrinsic meaning other then those data objects with the same label (per label type) belong to the same group. 标签是没有任何内在含义的整数,只有具有相同标签(每种标签类型)的那些数据对象属于同一组。 I want to know to what percentage the two labellings are the same. 我想知道两个标签的百分比相同。

For example (pseudo-code where the == is element-wise): 例如(伪代码,其中==为元素方式):

>>> label1 = [1,1,1,1,2,2,2,3,3,3,3,3,4,4]
>>> label2 = [1,1,2,2,2,2,2,2,2,3,3,4,4,4]
>>> correctness = sum_of_true(label1 == label2) / 14
correctness: 9 / 14 = 0.6428571

However the labels might not used the same way. 但是,标签可能使用的方式不同。 For example 例如

>>> label1 = [1,1,1,1,2,2,2,3,3,3,3,3,4,4]
>>> label2 = [2,2,2,2,1,1,1,4,4,4,4,4,3,3]

are actually the same labelled and the correctness should be 1.0 . 标记相同,正确性应为1.0

For that I need to rename the label2 in such a way that the labels are as similar to label1 as possible. 为此,我需要重命名label2,以使标签尽可能类似于label1。

Inefficient solution 低效的解决方案

An inefficient solution is to simply try to rename label2 in all possible solutions, calculate for each renaming the correctness as above in the example and take the solution with the best correctness. 一种低效的解决方案是简单地尝试在所有可能的解决方案中重命名label2 ,为每个重命名正确性进行计算,如示例中的上述,并采用具有最佳正确性的解决方案。 However the number of possible renames is the permutation of the number of labels. 但是,可能的重命名数量是标签数量的排列。 This can be a really huge number and makes this approach unusable. 这可能是一个非常庞大的数字,并且使这种方法无法使用。

Other solutions 其他解决方案

I know about normalized mutual information (nmi) as a means to compare labels, but this is not what I am looking for. 我知道标准化的互信息(nmi)作为比较标签的一种方法,但这不是我想要的。 Reasons are that firstly nmi is not linear, secondly it is difficult to understand and communicate and thirdly I simply want something else ;-) - in this case to know about he number (~ percentage) of same labelled data objects. 原因是,首先,nmi不是线性的;其次,它难以理解和交流;其次,我只是想要其他东西;-)-在这种情况下,要知道相同标记数据对象的数量(〜百分比)。 The reason I want this something else has something to do with the later application of the labels. 我想要其他原因的原因与标签的后续应用有关。

So for example 所以举个例子

>>> label1 = [1,1,1,1]
>>> label2 = [1,2,3,4]

I still want this to be of correctness 1/4 . 我仍然希望这是正确的1/4 I do not want to discuss here whether that is smart or not. 我不想在这里讨论这是否明智。 In my later application this is what I need. 在我以后的应用程序中,这就是我所需要的。

Allowing merging 允许合并

Additionally there is the issue that the number of labels may be different for different for label1 and label2 . 此外,还有一个问题,即对于label1label2 ,标签的数量可能会有所不同。 For my application I might be actually useful to be lenient towards this, allowing merging of labels to one on either side. 对于我的应用程序,宽大地对待这一点可能会很有用,因为它允许将标签合并到任一侧。 For example 例如

>>> label1 = [1,1,1,1]
>>> label2 = [1,2,3,4]

would become correctness of 1 if it's lenient towards merging of label2 , while it would be 0.5 for 将成为正确性1 ,如果是朝合并宽松label2 ,而这将是0.5

>>> label1 = [1,1,2,2]
>>> label2 = [1,2,3,4]

Question

How can I calculate the correctness efficiently for 如何有效地计算正确性

  1. No merging allowed. 不允许合并。
  2. Merging in the first label allowed. 合并到允许的第一个标签中。
  3. Merging in the second label allowed. 允许合并第二个标签。

where, surely, the solution for 2. and 3. would be the same. 当然,对于2.和3.解决方案将是相同的。

Notes 笔记

  • I am using python for implementation. 我正在使用python实现。
  • Please tell me what tags to use for this question if you know. 如果您知道的话,请告诉我该问题使用哪些标签。 I am not sure. 我不确定。

There are several well-established methods to evaluate the similarity of two clustering results. 有几种公认的方法可以评估两个聚类结果的相似性。 They already solved the alignment problem, which gets worse if the number of clusters vary. 他们已经解决了对齐问题,如果簇数变化,对齐问题将变得更加严重。

You should probably just use one of them, in particular: 您可能应该只使用其中之一,尤其是:

  1. Rand index 兰德指数
  2. Adjusted Rand Index 调整后的兰德指数
  3. Jaccard 杰卡德
  4. Fowlkes-Mallows index 福克斯-马洛斯指数

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM