简体   繁体   中英

How can I read quickly data from a huge collection in ArangoDB using Java driver

I am evaluating ArangoDB (version 3.2.4) as a replacement for MongoDB. We have a huge collection containing 2.700.000 documents. Next year this collection will increase (nearly 4.000.000 documents).

If I want to read data from that collection using the Java driver (version 4.2) it takes a lot of time for the cursor to fetch that data. The time depends on the size of fetched documents, which means, if I want to fetch all documents, it takes about 10 minutes for the cursor to fetch the data:

AQL:

for doc in myHugeCollection
    RETURN { "name": doc.name }

Java code:

    AqlQueryOptions aqlQueryOptions = new AqlQueryOptions();
    aqlQueryOptions.batchSize(500);
    aqlQueryOptions.count(false);
    aqlQueryOptions.cache(true);

    ArangoCursor<MyHugeCollection> arangoCursor = arangoDatabase.query(
            aqlQuery,
            new HashMap<>(),
            aqlQueryOptions,
            MyHugeCollection.class);

This will take about 10 minutes until I am able to access the data via the cursor. And because I set the batch size to 500 my expectation was a quick response, because fetching the first 500 results is extremely fast.

modified AQL fetching first 500 documents:

for doc in myHugeCollection
    limit 500
    RETURN { "name": doc.name }

This query will take about 20 ms.

So, my question is what am I doing wrong? How can I access data in a huge collection without waiting minutes for the cursor?

It depends how you access your cursor.

When you convert it to List every document of the result is fetched.

List<MyHugeCollection> asList = arangoCursor.asListRemaining();

When you iterate over it with next() or forEachRemaining() (reguires Java 8) you can process the first 500 documents before the next batch is fetched from the database.

for (; arangoCursor.hasNext();) {
  MyHugeCollection doc = arangoCursor.next();
  // TODO
}

or

arangoCursor.forEachRemaining(doc -> {
  // TODO
});

Seems you need some Async invocaiton, so that your code doesn't wait for whole data set to be returned, but can start working after some initial data are returned. Have you tried the Java Async driver ( https://github.com/arangodb/arangodb-java-driver-async )? I think you should be able to start doing some work as soon as Arango gets first result set... Try to look for this part at the async driver manual:

    db.query(query, bindVars, null, MyObject.class).thenAccept(cursor -> {
     cursor.forEachRemaining(obj -> {
       System.out.println(obj.getName());
     });
   });

Another hint would be to try the VelocyPack objects provided by Java Driver. But I am not sure if they are async as your use case probably requires.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM