简体   繁体   中英

Lucene: Multithread document duplication

I have multiple threads which perform search in the lucene index. Before each search, there is a check whether the content is already indexed and if not it is then added to the index. If two parallel searches on unindexed content occure at the same time, there will be duplicated documents and guess the results of the search will be messed up.

I have found the following method: IndexWriter.updateDocument

but I think this does not solve the multithread problem I am facing.

Any suggestions how to resolve this are appreciated.

First Make sure there is only one method( IndexWriter#updateDocument() ) call call at a time, you would to achieve it with a shared object belong to your threads, like this

class Search implements Runnable{
private Object lock=new Object();
private volatile boolean found=false;
    public void run(){
      //business
      if(<<found something!>> && !found){
        synchronized(lock){/*call the related-method*/found=true;}
      }
      //business
      }
}

Second you need to track every keys have found during the search to avoid duplication, maybe checking the key or using a simple boolean check.

and please beware of useless process by signalling another threads about aborting their process for searching, IF you just need the very first founded keys, it's dependents on business.

If you're not able to modify the source of your updates/additions to be smarter about avoiding duplicates, then you'll have to create a choke point somewhere. The goal is simply to do it with the least amount of contention possible.

One way to do it would be to have a request queue, a work queue and a ConcurrentHashMap for lookups. All new requests are added to the request queue which is processed by a single "gatekeeper" thread. The gatekeeper can take one request at a time or drain the queue and process all pending requests in a loop to reduce contention on that end.

In order to process a request, the gatekeeper does putIfAbsent on the ConcurrentHashMap. If the return value is null, the update/insert request can be added to the actual work queue. If the value was already in the map, then.... see #2 below. Realistically you could use more than 1 gatekeeper since putIfAbsent is atomic, but it'd just increase contention on the HashMap. The gatekeeper's actual processing time is so low that you don't really gain anything by throwing more of them at the request queue.

The work queue threads will be able to process multiple updates/insertions concurrently as long as they don't modify the same record. When the work queue threads finish processing a request, they remove the value from the ConcurrentHashMap so that the gatekeeper knows it's safe to modify that record again.

--

Some things to think about:

1) How do you want to define what can be done simultaneously? It probably shouldn't be hashing the full request because you wouldn't want two different requests to modify the same document at the same time, would you?

2) What do you do with requests that cannot currently be processed because they have duplicates in the queue already (or requests that modify the same doc, as in point #1)? Throw them out? Put them in a secondary updating queue that tries again periodically? How do you respond to the original requester if its request is in an indefinite holding pattern?

3) Does the order in which requests are processed matter?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM