简体   繁体   中英

MarkLogic Optic API: order-by performance

We have two types of documents: books and book sections. We use TDE to define views for the two types.

Schema (relevant part):

  • view books: id, title
  • view booksections: id, bookid

The use case is to list the books with more than 5000 sections. For each book, title and number of sections should be returned. Using the Optic API, the query with group-by is like this:

op:from-view("myschema", "books") => 
    op:join-inner(op:from-view("myschema", "booksections"), op:on(
        op:view-col("books", "id"),
        op:view-col("booksections",   "bookid"))) => 
    op:group-by(
        (op:view-col("books", "title")), 
        (op:count("count", op:view-col("booksections", "id")))) => 
    op:where(op:ge(op:col("count"), 5000)) => 
    op:select((op:view-col("books", "title"), "count")) => 
    op:order-by(op:desc("count")) => 
    op:result()

The query returns a small result set: 4 books. Now the interesting thing is that this query needs 5 seconds to complete, and if I remove the op:order-by statement, only 3 seconds. Somehow 2 seconds are spent to order the 4 books in the result.

Is there anything I can do to speed up the ordering (except for doing the ordering as a post-processing step)?

The times were measured with warm triple caches. op:explain shows the order-by operation as the outer-most operation, indicating that the ordering is applied to the small set of 4 books. Using SQL has resulted in the same run times and the same acceleration without order-by.

Definitely upgrading to new MarkLogic version might help solve the problem. Despite that, spending two seconds to sort 4 rows of result is definitely not convincing. There is much more happening during the query execution when you add the op:order-by() clause and that should explain the increase in time.

To understand better on what actually happens during the execution of two queries, we should take a look at the query plans(using op:explain()) returned by the server. Based on the statistics on the underlying data and order-by() clause added to the query, the query optimizer might choose a different query plan. Sharing the query plans for the two queries will help us lead to the right direction to help the optimizer choose the better query plan.

You will probably be better off contacting MarkLogic Support with your test case. I'd say that Ramesh is right, and that the query optimizer is picking a sub-optimal query plan for your query.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM