简体   繁体   English

如何有效地找到相似的文件

[英]How to efficiently find similar documents

I have lots of document that I have clustered using a clustering algorithm. 我有很多使用聚类算法进行聚类的文档。 In the clustering algorithm, each document may belong to more than one clusters. 在聚类算法中,每个文档可能属于一个以上的聚类。 I've created a table storing the document-cluster assignment and another one which stores the cluster-document info. 我创建了一个表,用于存储document-cluster分配,另一个表用于存储cluster-document信息。 When I look for the list of similar documents to a given document (let's sat d_i ). 当我寻找给定文档的相似文档列表时(让我们坐s d_i )。 I first retrieve the list of clusters to which it belongs (from the document-cluster table) and then for each cluster c_j in the document-cluster I retrieve the lists of documents which belong to c_j from the cluster-document table. 我首先获得其所属(从集群列表document-cluster表),然后为每个集群C_J document-cluster我找回的文件属于从到C_J名单cluster-document表。 There are more than one c_j, so obviously there will be in multiple lists. c_j不止一个,因此显然会有多个列表。 Each list have many documents and apparently there might be overlaps among these lists. 每个列表都有许多文档,这些列表之间显然可能有重叠。

In the next phase and in order to find the most similar documents to d_i, I rank the similar documents based on the number of clusters they have in common with d_i. 在下一阶段中,为了找到与d_i最相似的文档,我根据与d_i相同的簇数对相似的文档进行排名。

My question is about the last phase. 我的问题是关于最后一个阶段。 A naive solution is to create a sorted kind of HashMap which has the document as the key and # common clusters as the value. 一个幼稚的解决方案是创建一种排序的HashMap,它以文档为键,并以#个公共簇为值。 However as each list might contains many many documents, this may not be the best solution. 但是,由于每个列表可能包含许多文档,因此这可能不是最佳解决方案。 Is there any other way to rank the similar items? 还有其他方法可以对类似项目进行排名吗? Any preprocessing or ..? 任何预处理或..?

Assuming that the number of arrays is relatively small comparing to the number of elements (and in particular, the number of arrays is in o(logn) ), you can do it by a modification of a bucket sort : 假设数组的数量与元素的数量相比相对较小(特别是数组的数量在o(logn) ),则可以通过修改bucket sort来实现

Let m be the number of arrays create a list containing m buckets buckets[] , where each bucket[i] is a hashset 令m为数组数,创建一个包含m个buckets buckets[]的列表,其中每个bucket[i]是一个哈希集

for each array arr:
   for each element x in arr:
      find if x is in any bucket, if so - let that bucket id be i:
          remove x from bucket i  
          i <- i + 1
      If no such bucket exist, set i=1
      add x to bucket i

for each bucket i=m,m-1,...,1 in descending order:
   for each element x in bucket[i]:
      yield x

The above runs in O(m^2*n): 上面的代码在O(m ^ 2 * n)中运行:

  • Iterating over each array 遍历每个数组
  • Iterating over all elements in each array 遍历每个数组中的所有元素
  • Finding the relevant bucket. 找到相关的存储桶。

Note that the last one can be done by adding a map:element->bucket_id, and be done in O(1) using hash tables, so we can improve it to O(m*n) . 请注意,最后一个可以通过添加map:element-> bucket_id来完成,并使用哈希表在O(1)中完成,因此我们可以将其改进为O(m*n)


An alternative is to use a hashmap as a histogram that maps from element to its number of occurances, and then sort the array including all elements based on the histogram. 另一种选择是使用哈希图作为直方图,从元素映射到其出现次数,然后根据直方图对包括所有元素的数组进行排序。 The benefit of this approach: it can be distributed very nicely with map-reduce : 这种方法的好处:可以使用map-reduce很好地分发它:

map(partial list of elements l):
    for each element x:
       emit(x,'1')
reduce(x, list<number>):
   s = sum{list}
   emit(x,s)
combine(x,list<number>):
   s = sum{list} //or size{list} for a combiner
   emit(x,s)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM