SolrJ - Asynchronously indexing documents with ContentStreamUpdateRequest

Question

I am using SolrJ API 4.8 to index rich documents to solr. But I want to index these documents asynchronously. The function that I made sends documents synchronously but I don't know how to change it to make it asynchronously. Any idea?

Function:

public Boolean indexDocument(HttpSolrServer server, String PathFile, InputReader external)
{  

        ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");

        try {
                up.addFile(new File(PathFile), "text");
        } catch (IOException e) {
                Logger.getLogger(ANOIndexer.class.getName()).log(Level.SEVERE, null, e);
                return false;
        }

        up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

        try {
                server.request(up);
        } catch (SolrServerException e) {
                Logger.getLogger(ANOIndexer.class.getName()).log(Level.SEVERE, null, e);
                return false;

        } catch (IOException e) {
                Logger.getLogger(ANOIndexer.class.getName()).log(Level.SEVERE, null, e);
                return false;   
        }
        return true;
}

Solr server: version 4.8

Answer 1

It sounds like you might want to look at using ExecutorService and FutureTask to do this:

private static HttpSolrServer server;
private static int threadPoolSize = 4;  //Set this to something appropiate for your environment

public static void main(String[] args) {
    ExecutorService executor = Executors.newFixedThreadPool(threadPoolSize);
    ArrayList<FutureTask<Boolean>> taskList = new ArrayList<FutureTask<Boolean>>();
    ArrayList<String> paths = new ArrayList<String>();
    //Initialize your list of paths here

    for (String path : paths) {
        FutureTask<Boolean> futureTask = new FutureTask<Boolean>(new IndexDocumentTask(path));
        taskList.add(futureTask);
        executor.execute(futureTask);
    }

    for (int i = 0; i < taskList.size(); i++) {
        FutureTask<Boolean> futureTask = taskList.get(i);

        try {
            System.out.println("Index Task " + i + (futureTask.get() ? " finished successfully." : " encountered an error."));
        } catch (ExecutionException e) {
            System.out.println("An Execution Exception occurred with Index Task " + i);
        } catch (InterruptedException e) {
            System.out.println("An Interrupted Exception occurred with Index Task " + i);
        }
    }

    executor.shutdown();
}

static class IndexDocumentTask implements Callable<Boolean> {

    private String pathFile;

    public IndexDocumentTask(String pathFile) {
        this.pathFile = pathFile;
    }

    @Override
    public Boolean call() {
        return indexDocument(pathFile);
    }

    public Boolean indexDocument(String pathFile) {
        ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");

        try {
            up.addFile(new File(pathFile), "text");
        } catch (IOException e) {
            Logger.getLogger(ANOIndexer.class.getName()).log(Level.SEVERE, null, e);
            return false;
        }

        up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

        try {
            server.request(up);
        } catch (SolrServerException e) {
            Logger.getLogger(ANOIndexer.class.getName()).log(Level.SEVERE, null, e);
            return false;

        } catch (IOException e) {
            Logger.getLogger(ANOIndexer.class.getName()).log(Level.SEVERE, null, e);
            return false;
        }
        return true;
    }
}

This is untested code, so I'm not sure if calling server.request(up) like that is thread-safe. I figured it was cleaner to just use one instance of HttpSolrServer, but you could also create new HttpSolrServer instances in each task.

If you wanted to, you can augment IndexDocumentTask to implement a Callable<Tuple<String, Boolean>> , so that you could retrieve both the filename of the document to be indexed, and whether or not the index was successful or not.

Even though I don't think sending multiple requests to the Solr server at a time should be a problem, you may want to throttle your requests so as to not to overload the Solr server.

SolrJ - Asynchronously indexing documents with ContentStreamUpdateRequest

Question

1 answers

solution1
1 ACCPTED 2014-11-11 16:28:26

SolrJ - Asynchronously indexing documents with ContentStreamUpdateRequest

Question

1 answers

solution1 1 ACCPTED 2014-11-11 16:28:26

solution1
1 ACCPTED 2014-11-11 16:28:26