简体   繁体   English

如果它们经常一起出现,则对列表中存在的元素进行聚类的启发式方法

[英]Heuristic for clustering elements present across lists if they appear together often

I have a few lists.我有几个清单。 I want to cluster the elements if they come together in the lists often.如果元素经常出现在列表中,我想对它们进行聚类。

Details about lists:有关列表的详细信息:

  1. All the elements in the lists are sorted.列表中的所有元素都已排序。
  2. No duplications are present in any list (so they can be assumed as sets).任何列表中都不存在重复项(因此可以将它们假定为集合)。
  3. Total number of elements present across all the lists are huge (>5000).所有列表中存在的元素总数很大(> 5000)。
  4. Total number of lists present are also huge (around ~10000).存在的列表总数也很大(大约 10000 个)。
  5. Number of elements in each list are around ~1000.每个列表中的元素数量约为 1000。

Problem:问题:
for instance, let's say I have the following lists:例如,假设我有以下列表:
L1 = ["Apple", "Banana", "Car", "Carpet", "Cat", "Dog", "Donkey"] L1 = [“苹果”、“香蕉”、“汽车”、“地毯”、“猫”、“狗”、“驴”]
L2 = ["Apple", "Car", "Carpet"] L2 = [“苹果”、“汽车”、“地毯”]
L3 = ["Ant", "Apple", "Author", "Banana", "Car", "Carpet", "Dog"] L3 = [“蚂蚁”、“苹果”、“作者”、“香蕉”、“汽车”、“地毯”、“狗”]
L4 = ["Banana", "Dog", "Donkey"] L4 = [“香蕉”、“狗”、“驴”]

Possible Solution:可能的解决方案:
some possible clusters for the above lists are:上述列表的一些可能的集群是:
["Apple", "Car", "Carpet"] (since they appear together in L1, L2, L3) ["Apple", "Car", "Carpet"] (因为它们一起出现在 L1, L2, L3)
["Banana", "Dog", "Donkey"] (they appear together in L1, L4) [“香蕉”、“狗”、“驴”](它们一起出现在 L1、L4 中)

Objective:客观的:

  • To have max possible length for each cluster.为每个集群设置最大可能长度。
    NOTE: If cluster C1 is a subset of cluster C2, and C1 appears together 'x' number of times, and C2 appears 'x - delta' number of times, where delta is very small;注意:如果集群 C1 是集群 C2 的子集,并且 C1 一起出现“x”次,C2 出现“x - delta”次数,其中 delta 非常小; then we create only cluster C2.然后我们只创建集群 C2。 In these cases, size of the cluster is priority.在这些情况下,集群的大小是优先考虑的。
    If the delta is significantly large, we create both the clusters.如果增量非常大,我们创建两个集群。

    Example: In the above example, C1 = ["Banana", "Dog"] appears together in L1, L3, L4.示例:在上面的示例中,C1 = ["Banana", "Dog"] 一起出现在 L1、L3、L4 中。 And C2 = ["Banana", "Dog", "Donkey"] appears together in L1, L4.而C2 = ["Banana", "Dog", "Donkey"] 一起出现在L1, L4。 Here cluster C2 is more preferred since it has more elements in it and for C1, C2 the number of places they appear together are almost same (C1 appears just one time more than C2 - in such cases max length is priority).这里集群 C2 是更优选的,因为它有更多的元素对于 C1、C2,它们一起出现的位置数量几乎相同(C1 只比 C2 多出现一次 - 在这种情况下,最大长度是优先级)。

Can someone provide a heuristic or shower some views on how to do this?有人可以就如何做到这一点提供启发式或淋浴一些观点吗?

My thoughts are revolving around using intersection between lists or using inverted indices.我的想法是使用列表之间的交集或使用倒排索引。

Thanks in advance!提前致谢!

I'm going to recommend that you use cosine similarity to solve this.我将建议您使用余弦相似度来解决这个问题。 The idea is that each element is turned into a cluster of one element.这个想法是将每个元素变成一个元素的集群。 Then you use cosines to keep on finding the two most similar clusters, and merge them.然后你使用余弦继续寻找两个最相似的集群,并将它们合并。 Continue until you're happy with your groups.继续,直到您对您的小组感到满意为止。


Here is the basic math behind that.这是其背后的基本数学。

Suppose that C is a cluster of elements.假设C是一个元素簇。 We can turn it into a ~10000 dimension vector V = [v0, v1, ... vn] by making each list a dimension, and just putting down the number of elements of the cluster in that list.我们可以把它变成一个约 10000 维的向量V = [v0, v1, ... vn] ,方法是让每个列表都成为一个维度,然后只写下该列表中集群的元素数量。

The "dot product" C o D of two clusters is just the sum of vi * wi .两个簇的“点积” C o D只是vi * wi的总和。 The "length" of a cluster || C ||簇的“长度” || C || || C || is sqrt(C o C) .sqrt(C o C) The "cosine" cos(C, D) between two clusters is (C o D) / || C || / || D ||两个簇之间的“余弦” cos(C, D)(C o D) / || C || / || D || (C o D) / || C || / || D || . . See, for example, this explanation of why.参见,例如, 这个原因的解释 When cosine is close to 1, two clusters are very similar.当余弦接近 1 时,两个簇非常相似。 When it is close to 0, they are very different.当它接近 0 时,它们是非常不同的。

Now suppose that we have clusters C , D and E , and decide to merge C and D into a bigger cluster.现在假设我们有集群CDE ,并决定将CD合并到一个更大的集群中。 Then || CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C)那么|| CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C) || CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C) || CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C) from the law of cosines . || CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C)来自余弦定律 And, finally, CuD o E = (C o E) + (D o E) .最后, CuD o E = (C o E) + (D o E) This allows us to combine clusters without having to recalculate everything from scratch.这使我们能够组合集群,而无需从头开始重新计算所有内容。


OK, enough math, now let's talk about programming.好的,足够的数学,现在让我们谈谈编程。

You have lists of ~1000 elements with approximately ~500,000 pairs each.您有大约 1000 个元素的列表,每个元素大约有 500,000 对。 You have 10,000 of these.你有 10,000 个。 For approximately 5,000,000,000 data points of interest.对于大约 5,000,000,000 个感兴趣的数据点。 That isn't a huge data set, but it is enough we need to not be naive about how to do it.这不是一个庞大的数据集,但我们不需要天真地知道如何去做就足够了。

The technique you need is called MapReduce .您需要的技术称为MapReduce A map-reduce has the following phases: map-reduce 有以下几个阶段:

  1. For each input record, emit some (key, value) pairs through a map function.对于每个输入记录,通过map function 发出一些(key, value)对。
  2. The framework groups by key .框架按key分组。
  3. Then your reduce function gets (key, [value1, value2, value3, ...]) and does something with it.然后你的reduce function 得到(key, [value1, value2, value3, ...])并用它做一些事情。

This works well for distributed programming because you can run as many mappers as you want in parallel, the framework can do the heavy lifting for step 2, and then you can run your reducers in parallel.这适用于分布式编程,因为您可以并行运行任意数量的映射器,框架可以为第 2 步完成繁重的工作,然后您可以并行运行减速器。 And this paradigm can handle a wide variety of problems.这种范式可以处理各种各样的问题。

But for a problem of your size, you can ALSO just fake it.但是对于你的体型问题,你也可以假装它。 You "emit" pairs by writing key value lines to a file.您通过将key value行写入文件来“发出”对。 The heavy lifting for step 2 is done by a Unix command:步骤 2 的繁重工作由 Unix 命令完成:

LC_ALL=C sort file_in > file_out

And now your keys have been grouped together, letting you easily gather them together and reduce them.现在您的密钥已组合在一起,让您可以轻松地将它们聚集在一起并减少它们。

So given a list L we simply:所以给定一个列表L我们简单地:

def mapper (L):
    for i in range(len(L)):
        for j in range(i, len(L)):
            emit((L[i], L[j]), 1)

And then the reducer will get:然后reducer会得到:

def reducer (key, values):
    elt1, elt2 = key
    dot_product = len(values)
    ... record this somewhere ...

And now we have calculated all non-zero dot products between elements.现在我们已经计算了元素之间的所有非零点积。 Any missing pairs are zeros.任何缺失的对都是零。 If you have ~5000 elements, you'll get ~12,500,000 dot products.如果您有约 5000 个元素,您将获得约 12,500,000 个点积。 From which we can calculate ~5000 lengths and ~12,500,000 cosines.我们可以从中计算出约 5000 个长度和约 12,500,000 个余弦值。

Next programming concept, a priority queue which can be implemented with a heap .下一个编程概念,可以用实现的优先级队列 (We'll need to use some properties of a heap.) So the basic idea is as follows: (我们需要使用堆的一些属性。)所以基本思想如下:

put all pairs of distinct elements in a priority queue, ordered by max cosine.
put all elements in a lookup saying they are valid clusters
for some number of times:
    take the highest cosine pair
    if both are still valid:
        create a new cluster that is the union of the two
        calculate its lengths and cosines with all other valid clusters
            (The necessary math is above.)
        stick the cosine pairs into the queue
        take the original clusters out of the valid lookup
        add the new cluster as valid

Now this has a problem.现在这有一个问题。 Its problem is that every time we merge two groups we correctly record what is valid and what is not.它的问题是,每次我们合并两个组时,我们都会正确记录什么是有效的,什么是无效的。 But we have thousands of pairs that are now garbage, and thousands that are added to the priority queue.但是我们有数千对现在是垃圾,还有数千对被添加到优先级队列中。 The queue is rapidly growing.队列正在迅速增长。

We solve this by doing garbage collection every so often, such as every time it doubles in size.我们通过每隔一段时间进行一次垃圾收集来解决这个问题,比如每次它的大小翻倍。 For that we simply go through the array underlying the heap, and make a new array of the valid ones.为此,我们只需通过堆底层数组 go ,并创建一个有效数组的新数组。 That new array is not a heap, but we can call the heapify operation on it to turn it back into a heap, and therefore a priority queue.那个新数组不是堆,但是我们可以调用它的heapify操作来把它变回堆,因此是一个优先级队列。 And then go back to work.然后 go 重新开始工作。

And now you simply run this until you hit some sort of stopping condition.现在你只需运行它,直到你遇到某种停止条件。 Such as having the desired number of groups, or the cosine getting too small.例如具有所需数量的组,或者余弦变得太小。

And then your final list of valid groups is your answer.然后您的最终有效组列表就是您的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM