[英]Heuristic for clustering elements present across lists if they appear together often
I have a few lists.我有几个清单。 I want to cluster the elements if they come together in the lists often.如果元素经常出现在列表中,我想对它们进行聚类。
Details about lists:有关列表的详细信息:
Problem:问题:
for instance, let's say I have the following lists:例如,假设我有以下列表:
L1 = ["Apple", "Banana", "Car", "Carpet", "Cat", "Dog", "Donkey"] L1 = [“苹果”、“香蕉”、“汽车”、“地毯”、“猫”、“狗”、“驴”]
L2 = ["Apple", "Car", "Carpet"] L2 = [“苹果”、“汽车”、“地毯”]
L3 = ["Ant", "Apple", "Author", "Banana", "Car", "Carpet", "Dog"] L3 = [“蚂蚁”、“苹果”、“作者”、“香蕉”、“汽车”、“地毯”、“狗”]
L4 = ["Banana", "Dog", "Donkey"] L4 = [“香蕉”、“狗”、“驴”]
Possible Solution:可能的解决方案:
some possible clusters for the above lists are:上述列表的一些可能的集群是:
["Apple", "Car", "Carpet"] (since they appear together in L1, L2, L3) ["Apple", "Car", "Carpet"] (因为它们一起出现在 L1, L2, L3)
["Banana", "Dog", "Donkey"] (they appear together in L1, L4) [“香蕉”、“狗”、“驴”](它们一起出现在 L1、L4 中)
Objective:客观的:
Can someone provide a heuristic or shower some views on how to do this?有人可以就如何做到这一点提供启发式或淋浴一些观点吗?
My thoughts are revolving around using intersection between lists or using inverted indices.我的想法是使用列表之间的交集或使用倒排索引。
Thanks in advance!提前致谢!
I'm going to recommend that you use cosine similarity to solve this.我将建议您使用余弦相似度来解决这个问题。 The idea is that each element is turned into a cluster of one element.这个想法是将每个元素变成一个元素的集群。 Then you use cosines to keep on finding the two most similar clusters, and merge them.然后你使用余弦继续寻找两个最相似的集群,并将它们合并。 Continue until you're happy with your groups.继续,直到您对您的小组感到满意为止。
Here is the basic math behind that.这是其背后的基本数学。
Suppose that C
is a cluster of elements.假设C
是一个元素簇。 We can turn it into a ~10000 dimension vector V = [v0, v1, ... vn]
by making each list a dimension, and just putting down the number of elements of the cluster in that list.我们可以把它变成一个约 10000 维的向量V = [v0, v1, ... vn]
,方法是让每个列表都成为一个维度,然后只写下该列表中集群的元素数量。
The "dot product" C o D
of two clusters is just the sum of vi * wi
.两个簇的“点积” C o D
只是vi * wi
的总和。 The "length" of a cluster || C ||
簇的“长度” || C ||
|| C ||
is sqrt(C o C)
.是sqrt(C o C)
。 The "cosine" cos(C, D)
between two clusters is (C o D) / || C || / || D ||
两个簇之间的“余弦” cos(C, D)
是(C o D) / || C || / || D ||
(C o D) / || C || / || D ||
. . See, for example, this explanation of why.参见,例如, 这个原因的解释。 When cosine is close to 1, two clusters are very similar.当余弦接近 1 时,两个簇非常相似。 When it is close to 0, they are very different.当它接近 0 时,它们是非常不同的。
Now suppose that we have clusters C
, D
and E
, and decide to merge C
and D
into a bigger cluster.现在假设我们有集群C
、 D
和E
,并决定将C
和D
合并到一个更大的集群中。 Then || CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C)
那么|| CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C)
|| CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C)
|| CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C)
from the law of cosines . || CuD ||^2 = (CuD o CuD) = (C o C) + (D o D) + 2 * || C || * || D || * cos(B, C)
来自余弦定律。 And, finally, CuD o E = (C o E) + (D o E)
.最后, CuD o E = (C o E) + (D o E)
。 This allows us to combine clusters without having to recalculate everything from scratch.这使我们能够组合集群,而无需从头开始重新计算所有内容。
OK, enough math, now let's talk about programming.好的,足够的数学,现在让我们谈谈编程。
You have lists of ~1000 elements with approximately ~500,000 pairs each.您有大约 1000 个元素的列表,每个元素大约有 500,000 对。 You have 10,000 of these.你有 10,000 个。 For approximately 5,000,000,000 data points of interest.对于大约 5,000,000,000 个感兴趣的数据点。 That isn't a huge data set, but it is enough we need to not be naive about how to do it.这不是一个庞大的数据集,但我们不需要天真地知道如何去做就足够了。
The technique you need is called MapReduce .您需要的技术称为MapReduce 。 A map-reduce has the following phases: map-reduce 有以下几个阶段:
(key, value)
pairs through a map function.对于每个输入记录,通过map function 发出一些(key, value)
对。key
.框架按key
分组。(key, [value1, value2, value3, ...])
and does something with it.然后你的reduce function 得到(key, [value1, value2, value3, ...])
并用它做一些事情。This works well for distributed programming because you can run as many mappers as you want in parallel, the framework can do the heavy lifting for step 2, and then you can run your reducers in parallel.这适用于分布式编程,因为您可以并行运行任意数量的映射器,框架可以为第 2 步完成繁重的工作,然后您可以并行运行减速器。 And this paradigm can handle a wide variety of problems.这种范式可以处理各种各样的问题。
But for a problem of your size, you can ALSO just fake it.但是对于你的体型问题,你也可以假装它。 You "emit" pairs by writing key value
lines to a file.您通过将key value
行写入文件来“发出”对。 The heavy lifting for step 2 is done by a Unix command:步骤 2 的繁重工作由 Unix 命令完成:
LC_ALL=C sort file_in > file_out
And now your keys have been grouped together, letting you easily gather them together and reduce them.现在您的密钥已组合在一起,让您可以轻松地将它们聚集在一起并减少它们。
So given a list L
we simply:所以给定一个列表L
我们简单地:
def mapper (L):
for i in range(len(L)):
for j in range(i, len(L)):
emit((L[i], L[j]), 1)
And then the reducer will get:然后reducer会得到:
def reducer (key, values):
elt1, elt2 = key
dot_product = len(values)
... record this somewhere ...
And now we have calculated all non-zero dot products between elements.现在我们已经计算了元素之间的所有非零点积。 Any missing pairs are zeros.任何缺失的对都是零。 If you have ~5000 elements, you'll get ~12,500,000 dot products.如果您有约 5000 个元素,您将获得约 12,500,000 个点积。 From which we can calculate ~5000 lengths and ~12,500,000 cosines.我们可以从中计算出约 5000 个长度和约 12,500,000 个余弦值。
Next programming concept, a priority queue which can be implemented with a heap .下一个编程概念,可以用堆实现的优先级队列。 (We'll need to use some properties of a heap.) So the basic idea is as follows: (我们需要使用堆的一些属性。)所以基本思想如下:
put all pairs of distinct elements in a priority queue, ordered by max cosine.
put all elements in a lookup saying they are valid clusters
for some number of times:
take the highest cosine pair
if both are still valid:
create a new cluster that is the union of the two
calculate its lengths and cosines with all other valid clusters
(The necessary math is above.)
stick the cosine pairs into the queue
take the original clusters out of the valid lookup
add the new cluster as valid
Now this has a problem.现在这有一个问题。 Its problem is that every time we merge two groups we correctly record what is valid and what is not.它的问题是,每次我们合并两个组时,我们都会正确记录什么是有效的,什么是无效的。 But we have thousands of pairs that are now garbage, and thousands that are added to the priority queue.但是我们有数千对现在是垃圾,还有数千对被添加到优先级队列中。 The queue is rapidly growing.队列正在迅速增长。
We solve this by doing garbage collection every so often, such as every time it doubles in size.我们通过每隔一段时间进行一次垃圾收集来解决这个问题,比如每次它的大小翻倍。 For that we simply go through the array underlying the heap, and make a new array of the valid ones.为此,我们只需通过堆底层数组 go ,并创建一个有效数组的新数组。 That new array is not a heap, but we can call the heapify
operation on it to turn it back into a heap, and therefore a priority queue.那个新数组不是堆,但是我们可以调用它的heapify
操作来把它变回堆,因此是一个优先级队列。 And then go back to work.然后 go 重新开始工作。
And now you simply run this until you hit some sort of stopping condition.现在你只需运行它,直到你遇到某种停止条件。 Such as having the desired number of groups, or the cosine getting too small.例如具有所需数量的组,或者余弦变得太小。
And then your final list of valid groups is your answer.然后您的最终有效组列表就是您的答案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.