简体   繁体   中英

ArangoDB AQL: can I traverse a graph from multiple start vertices, but ensure uniqueVertices across all traversals?

I have a graph dataset with large number of relatively small disjoint graphs. I need to find all vertices reachable from a set of vertices matching certain search criteria. I use the following query:

FOR startnode IN nodes
    FILTER startnode._key IN [...set of values...]
    FOR node IN 0..100000 OUTBOUND startnode edges
        COLLECT k = node._key
        RETURN k

The query is very slow, even though it returns the correct result. This is because Arango actually ends up traversing the same subgraphs many times. For example, say there is the following subgraph:

a -> b -> c -> d -> e

When vertices a and c are selected by the filter condition, Arango ends up doing two independent traversals starting from a and c. It visits vertices d and e during both of these traversals, which wastes time. Adding uniqueVertices option doesn't help, because the vertex uniqueness is not checked across different traversals.

To confirm the impact on performance, I created an extra root document and added links from it to all the documents found by my filter:

FOR startnode IN nodes
    FILTER startnode._key IN [...set of values...]
    INSERT { _from: 'fakeVertices/0', _to: startnode._id } IN fakeEdges

Now the following query runs 4x faster than my original query, while producing the same result:

FOR node IN 1..1000000 OUTBOUND 'fakeVertices/0' edges, fakeEdges
    OPTIONS { uniqueVertices: 'global', bfs: true }
    COLLECT k = node._key
    RETURN k

Unfortunately, I cannot create fake vertex/edges for all of my queries as creating it takes even more time.

My question is: does Arango provide a way to ensure uniqueness of vertices visited across all traversals in given query? If not, are there any better way to solve the problem described above?

From what I understand, this is what the uniqueVertices option is for, but for each iteration of the FOR... statement, it considers vertices unique for the traversal from that start node. It doesn't know about other traversals that have happened on other nodes in the FOR... statement. It appears that you will traverse LOTS of vertices each time, and this happens from each new start node.

Just throwing this at the wall to see if it sticks, but what about a combination of the two queries, adding OPTIONS to the original?

FOR startnode IN nodes
    FILTER startnode._key IN [...set of values...]
    FOR node IN 0..100000 OUTBOUND startnode edges
        OPTIONS { uniqueVertices: 'global', bfs: true }
        COLLECT k = node._key
        RETURN k

Also, I would highly recommend a named graph instead of specifying edge collections. Not only is it far more flexible, it allows you to use shortest-path calculations as well, which might help here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM