简体   繁体   English

计算两个无序集之间的“距离”

[英]Calculating the “distance” between two unordered sets

Assume two sets (unordered, no duplicate elements): 假设有两组(无序,无重复元素):

A = set(["z", "x", "c"])
B = set(["x", "z", "d", "e"])

These sets have two common elements: "z" and "x", and some set-specific elements: c, d, e. 这些集合有两个共同的元素:“z”和“x”,以及一些特定于集合的元素:c,d,e。

How can you give each set a score, just like string-distance, while 你怎么能给每一组得分,就像字符串距离一样

  • disregarding the ordering of elements and 无视元素的排序和
  • imposing the no-duplicate constraint for each isolated set 对每个隔离集强加无重复约束

?

As you can see in the example, the size of each set can be different. 正如您在示例中所看到的,每个集合的大小可以不同。

The non-critical requirements for this algorithm are: 此算法的非关键要求是:

  • Insertion > Deletion (a set lacking an element implies higher cost, than the one that has one too many) if possible, or just INS = DEL 插入>删除(缺少元素的集合意味着比具有太多的元素更高的成本),如果可能,或者只是INS = DEL
  • Swap: 0 (no cost, since ordering has no effect on distance) 交换:0(无成本,因为订购对距离没有影响)

For now I have been calculating a set distance score: 现在我一直在计算一个设定的距离得分:

score_A = len(common(a,b)) / len(a)    # common(...) calculates intersection
score_B = len(common(a,b)) / len(b)

quadratic_score = sqrt(score_A * score_B)

How would you recommend approaching this problem or improving my solution? 您会如何建议解决此问题或改进我的解决方案?

Are there any algorithms that allow specification of costs? 是否有任何算法可以指定成本?


Right now I am about to define a simple algebra for set modification: 现在我要为集修改定义一个简单的代数:

def calculate_distance( a, b, insertion_cost=1, deletion_cost=1 ):
    """
    Virtually, a programmer-friendly set-minus.

    @return     the distance from A to B, mind that this is not
                a commutative operation.
    """
    score = 0
    for e in a:
        if e not in b: # implies deletion from A
            score += deletion_cost

    for e in b:
        if e not in a: # implies insertion into A
            score += insertion_cost

    return score

How can I normalize this value and against what? 我如何规范化这个值并反对什么?

How about the size of the set intersection over the size of the larger set? 设置交叉点的大小与较大集合的大小有何关系? So: 所以:

float(len(A.intersection(B)))/max(len(A),len(B))

It'll give you a number scaled in the range 0.0 to 1.0 which is often desirable. 它会给你一个在0.0到1.0范围内缩放的数字,这通常是可取的。 1.0 representing full equality, 0.0 representing nothing in common. 1.0表示完全相等,0.0表示没有共同点。

This answer is of course out of date with respect to the question, but hopefully will be picked up by any future visitors. 对于这个问题,这个答案当然是过时的,但希望任何未来的访问者都能接受。

Use the Jaccard distance , the cardinality (size of set) of the symmetric difference between the two sets divided by the cardinality of their union. 使用Jaccard距离 ,两组之间对称差异的基数(集合大小)除以其并集的基数。 In other terms, union minus intersection all divided by union. 换句话说,联合减去交集全部除以联合。

This assumes that the elements can be compared in a discrete fashion, ie they are equal or not. 这假设元素可以以离散的方式进行比较,即它们是否相等。 A desirable property is that the Jaccard distance is a metric . 理想的特性是Jaccard距离是度量

Similar question to this one 此类似的问题

Assuming OP is asking something as the "distance", I think it's better to make it 0 when two sets are identical according to the general requirements of a distance function 假设OP要求某些东西作为“距离”,我认为根据距离函数的一般要求,当两组相同时,最好将其设为0

And it would be also good to have symmetric and triangle inequality 并且对称三角不等式也是好的

symmetric is intuitive, and triangle inequality means d(A,C) ≤ d(A,B) + d(B,C) 对称是直观的, 三角不等式意味着d(A,C)≤d(A,B)+ d(B,C)

I suggest something like: 我建议像:

C = A.intersection(B)
Distance = sqrt(len(A-C)*2 + len(B-C)*2)

However I don't know how to prove the triangle inequality yet 但是我不知道如何证明三角不等式


To normalize OP's updated function result, just do score = score / (len(a) + len(b)) 要标准化OP的更新函数结果,只需score = score / (len(a) + len(b))

which will give you 1 when a doesn't intersect b , and 0 when a == b a不与b相交时会给你1,当a == b时会给0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM