简体   繁体   中英

Finding outliers in Gremlin to find nodes with more than N edges?

I'm trying to figure out how to find outliers in our graph. In particular nodes with more than N edges where N could be some high number. Our graph has over 2 billion nodes. Is there an efficient way to do this?

At that scale you probably are going to want to multi thread the queries and send requests to the server in batches. A good approximation for client threads is 2 times the number of vCPU on the server. If you are able to send lists of IDs that will be most efficient. Otherwise you will need to do a lot of range steps. Each thread would then do something like query the below for multiple sets of ID ranges:

g.V(<list of IDs>).filter(out().count().is(gt(x)))

You would then collect all the outliers in the application. I think you should approach this as a bit of a batch task that may take a while to complete.

The alternative would be to use Neptune Export to export the graph and load it into Spark and run a degree query using something like GraphFrames.

With a reasonably large instance I think the technique of using multiple threads will work, especially if you are able to easily generate the lists of vertex IDs you are looking for in each query. Spreading the queries across multiple read replicas will also speed things up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM