简体   繁体   中英

ArangoDB AQL Filtering Using Edges and Vertices with Unknown Positions in Graph Traversal Path

I have a generic graph structure where I need to find non-leaf nodes in the graph based on their connections to other nodes in the graph. The position of the node I want to return is not defined, and it is possible there are multiple paths to the node I want to return. I want to run a single query to return a bunch of items I am displaying in a sorted list to a client. I do not want to have to run multiple asynchronous queries and sort on the client side.

This list is filtered based on the edges that connect the vertices together, or if the node is connected to another node. The filter conditions are updated on the client side, which results in the query being re-constructed and the database re-queried. The position of the nodes in the graph that need to be returned is not guaranteed to be the same for all results, they may be leaf nodes, or anywhere in the path. The vertices I want to return can be identified via attributes on the edges leading to them, or away from them. Each edge also has a date attribute on it that is used for sorting and a type attribute that is used for filtering.

Image in a graph 'myGraph' such as I attempted to illustrate below.

------- 
| v:1 |\
------- \
   | \   \ -------
   |  |   \| v:4 |\
   |  \    ------- \
   |   |  /   ^     \ -------
   |    \/    |      \| v:7 |
   |    /|  return    -------   
   |   /  \             
   |  /   |              
-------   \
| v:2 |\   |
------- \   \
   |     \ -------
   |      \| v:5 |\
   |       ------- \
   |                \ -------
   |                 \| v:8 |\
   |                  ------- \ 
   |                     ^     \ -------
   |                     |      \| v:10|
-------                return    -------   
| v:3 |\   
------- \   
         \ -------
          \| v:6 |\
           ------- \
                    \ -------
                     \| v:9 |
                      -------
                         ^
                         | 
                       return

The above diagram illustrates what I want to return given one set of filtering conditions, but the returned results can vary if I change the filtering conditions. The nodes I want to return are easily identified based on the attributes on the edges leading to them or away from them.

I have a query that looks something like the following, but am having trouble finding a way to index the nodes in the path that have edges leading to or away from them that meet a specific filtering criteria.

FOR item in vertexCollection1
   FILTER .... // FILTER the vertices
   FOR v, e, p IN 1..4 OUTBOUND item._id GRAPH 'myGraph'
      // ?? Not sure how to efficiently return from here
      // ?? FILTER p.vertices[??].v == 7 OR p.vertices[??].v == 10
      // ?? FILTER p.edges[??].type == "type1" OR p.edges[??].type == "type2"... etc based on user selections
      // ?? LET date = p.edges[vertexPosition - 1].date 
      // ?? LET data = p.vertices[??]
      // SORT DATE_TIMESTAMP(date) DESC
      // RETURN {date: date, data: data}

I am currently using a [ ** ] operation to get the specific node based on what collection it resides in using something like the following:

LET data = p.vertices[ ** FILTER CONTAINS(CURRENT._id, "collectionName") OR ...]

but this is awkward and requires the vertices to be placed in specific collections to facilitate query construction. This also does not solve the problem of how to index the associated edges connecting to the node I want to return.

I apologize if this question is answered elsewhere, and if it is a pointer to the answer is appreciated. I am not sure on the correct terminology to concisely describe the problem and search accordingly.

Thanks!

I was able to get the behavior I needed using a query structured similar to the following:

LET events = (
FOR v, e, p IN 1..3 OUTBOUND 'collection/document_id' GRAPH 'myGraph' OPTIONS {"uniqueEdges": "global"}
    FILTER .... // Filter the vertices
    LET children = (
        FOR v1, e1, p1 IN 1..1 OUTBOUND v._id GRAPH 'myGraph'
            FILTER e1.type == "myEventType" OR ... // Filter immediate neighbors I care about
            SORT(e1.date)  // I have date timestamps on everything
            RETURN { child: v1._id, ... /* other child attributes as needed */ }
    )

    // FILTER .... conditions on children if necessary in context of v

    RETURN DISTINCT (data: v, children: children, ... /* other attributes as needed */ )
)

FOR event IN events
    SORT(event.date) // I need chronological sorting and have date attribute on every node
    RETURN event

The DISTINCT modifier on the RETURN clause appeared to remove duplicates that resulted from multiple paths to the same node and I was able to add the custom filters I needed based on the attributes on the various children nodes and the parent node.

I am not sure if this is the best or proper approach, but it works for my use case. If there are corrections or optimizations to be made please let me know.

Thanks!

--- Update on Performance

I am currently testing in a graph with approximately 700000 documents and 2000000 edges. The filter conditions are added to the query dynamically based on user selections in a web-app and the performance of the query depends greatly on the filter conditions added. If there are no filter, or very broad filter conditions the query can take over a second to execute (on our test hardware). If the filter conditions are very restrictive the query can execute in milliseconds. However, the default, and most common use case is for the slower versions of the query. I am only working with a small selection of data, we expect the number of documents and edges to grow into the 10's of millions so performance as we scale up is very much a concern. I have currently segmented the database into multiple graphs to try and reduce the scope and volume of nodes/edges any individual query can scan, but have not yet identified other optimizations that I can make to allow the query to scale as the dataset scales. We are currently working on improving our data-import infrastructure to scale the dataset, but have not yet completed that effort so I don't yet have any numbers on performance on a database more representative of our expected configuration.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM