简体   繁体   中英

How to efficiently find similar documents

I have lots of document that I have clustered using a clustering algorithm. In the clustering algorithm, each document may belong to more than one clusters. I've created a table storing the document-cluster assignment and another one which stores the cluster-document info. When I look for the list of similar documents to a given document (let's sat d_i ). I first retrieve the list of clusters to which it belongs (from the document-cluster table) and then for each cluster c_j in the document-cluster I retrieve the lists of documents which belong to c_j from the cluster-document table. There are more than one c_j, so obviously there will be in multiple lists. Each list have many documents and apparently there might be overlaps among these lists.

In the next phase and in order to find the most similar documents to d_i, I rank the similar documents based on the number of clusters they have in common with d_i.

My question is about the last phase. A naive solution is to create a sorted kind of HashMap which has the document as the key and # common clusters as the value. However as each list might contains many many documents, this may not be the best solution. Is there any other way to rank the similar items? Any preprocessing or ..?

Assuming that the number of arrays is relatively small comparing to the number of elements (and in particular, the number of arrays is in o(logn) ), you can do it by a modification of a bucket sort :

Let m be the number of arrays create a list containing m buckets buckets[] , where each bucket[i] is a hashset

for each array arr:
   for each element x in arr:
      find if x is in any bucket, if so - let that bucket id be i:
          remove x from bucket i  
          i <- i + 1
      If no such bucket exist, set i=1
      add x to bucket i

for each bucket i=m,m-1,...,1 in descending order:
   for each element x in bucket[i]:
      yield x

The above runs in O(m^2*n):

  • Iterating over each array
  • Iterating over all elements in each array
  • Finding the relevant bucket.

Note that the last one can be done by adding a map:element->bucket_id, and be done in O(1) using hash tables, so we can improve it to O(m*n) .


An alternative is to use a hashmap as a histogram that maps from element to its number of occurances, and then sort the array including all elements based on the histogram. The benefit of this approach: it can be distributed very nicely with map-reduce :

map(partial list of elements l):
    for each element x:
       emit(x,'1')
reduce(x, list<number>):
   s = sum{list}
   emit(x,s)
combine(x,list<number>):
   s = sum{list} //or size{list} for a combiner
   emit(x,s)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM