简体   繁体   English

Neo4j Cypher:快速查找最大的断开连接的子图

[英]Neo4j Cypher: Finding the largest disconnected subgraph fast

I have a graph with one million nodes. 我有一个包含一百万个节点的图表。 There are many disconnected subgraphs within it. 其中有许多断开连接的子图。 I would like to know what is the largest disconnected subgraph. 我想知道什么是最大的断开连接的子图。

For instance this in this graph example we got three disconnected subgraph, so for this case the output will be 7. 例如,在此图示例中,我们得到了三个断开连接的子图,因此对于这种情况,输出将为7。

I tried this but it is taking a long time, 我试过这个但是需要很长时间,

match p = ()-[*]-() return MAX(length(p)) as l order by l desc limit 1

Your query will only ever return the longest path between two separate nodes, not the size of the largest connected subgraph. 您的查询将只返回两个单独节点之间的最长路径,而不是最大连接子图的大小。

Unfortunately Neo4j does not currently have any native support for subgraph operations, and I don't think APOC Procedures has anything here either. 不幸的是,Neo4j目前对子图操作没有任何原生支持,我认为APOC程序也没有任何内容。

There are ways in Cypher to find subgraphs, but the queries I can think of are not fast or performant, and are likely to time out with large graphs. Cypher有一些方法可以找到子图,但我能想到的查询不是快速或高效的,而且很可能超时大图。 Here's one, and again, this is not recommended, it is likely to time out for you, but if it works, awesome: 这是一个,而且,不建议这样做,它很可能会超时,但如果它有效,那真棒:

MATCH (n)-[*0..]-(subgraphNode)
WITH n, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

If this is to be a query run often, or every so often, instead of only once, then I'd recommend a means of tracking your subgraphs. 如果这是一个经常运行的查询,或者经常运行,而不是只运行一次,那么我建议使用一种跟踪子图的方法。

While I can give an approach to creating subgraph tracking, the approach for keeping this updated across graph operations (those that merge subgraphs, divide into smaller subgraphs, or create new subgraphs) is bound to be trickier, and you'll likely need some kind of Java extension to perform post-transaction processing to maintain this. 虽然我可以提供一种创建子图跟踪的方法,但是在图形操作(合并子图,分成较小的子图或创建新的子图)中保持更新的方法必然会比较棘手,你可能需要一些Java扩展来执行事务后处理以维护它。

Also, this approach is best done during a maintenance window when no write operations are occurring. 此外,这种方法最好在维护窗口期间进行,而不会发生写入操作。

The end-goal for this is to attach a single :Subgraph node to every disconnected subgraph, which will make future operations on subgraphs much easier, including your case of finding the largest disconnected subgraph. 最终目标是将单个:子图节点附加到每个断开连接的子图上,这将使子图上的未来操作变得更加容易,包括查找最大断开连接子图的情况。

The overall approach to fulfilling that goal is to first label all nodes in your graph (with a label like :Unprocessed), then, in batched queries for :Unprocessed nodes, find the entire disconnected subgraph they are a part of, attach a single :Subgraph node to it, and then remove the :Unprocessed label from the subgraph. 实现该目标的总体方法是首先标记图中的所有节点(标签为:Unprocessed),然后,在批处理查询中:未处理的节点,找到它们所属的整个断开连接的子图,附加单个:子图节点到它,然后从子图中删除:未处理的标签。

So, first, label all nodes in your db: 首先,标记数据库中的所有节点:

MATCH (n)
SET n:Unprocessed

Next, the batch operation. 接下来,批处理操作。 You'll want to use APOC Procedures to allow batch processing (which will also take advantage of entire subgraphs being removed from the :Unprocessed label as we process them...we don't want to redundantly perform operations on subgraphs). 您将需要使用APOC过程来允许批处理(这也将利用从我们处理它们的未处理标签中删除整个子图...我们不希望冗余地对子图执行操作)。

CALL apoc.periodic.commit("
// only process a batch of :Unproccessed nodes at a time
MATCH (n:Unprocessed)
WITH n LIMIT {limit}
// subgraphNode will be all nodes in the subgraph including n
MATCH (n)-[*0..]-(subgraphNode)
WITH DISTINCT n, subgraphNode
REMOVE subgraphNode:Unprocessed
// find attach point node in each subgraph with smallest id
WITH n, min(id(subgraphNode)) as attachId
WITH DISTINCT attachId
MATCH (attachNode)
WHERE id(attachNode) = attachId
CREATE (attachNode)<-[:SUBGRAPH]-(:Subgraph)
RETURN count(*)
",{limit:100})

You can adjust your limit as necessary. 您可以根据需要调整限制。 A lower limit might actually work better, as this may reduce redundant operations on nodes of the same subgraph. 下限实际上可能更好,因为这可以减少同一子图的节点上的冗余操作。

Now that all disconnected subgraphs have a :Subgraph node attached, you can make faster and easier queries for each subgraph. 现在所有断开连接的子图都附加了:子图节点,您可以对每个子图进行更快速,更简单的查询。 So, to find the largest disconnected subgraph, you might use: 因此,要查找最大的断开连接的子图,您可以使用:

MATCH (sub:Subgraph)-[*]-(subgraphNode)
WITH sub, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

EDIT 编辑

I found a faster means of gathering subgraph nodes compared to using a variable relationship match. 与使用变量关系匹配相比,我发现了一种更快的方法来收集子图节点。 APOC's Path Expander functionality, using NODE_GLOBAL uniqueness, should perform faster. 使用NODE_GLOBAL唯一性的APOC路径扩展器功能应该更快。 Here are the relevant queries modified to use this approach. 以下是修改为使用此方法的相关查询。

CALL apoc.periodic.commit("
// only process a batch of :Unproccessed nodes at a time
MATCH (n:Unprocessed)
WITH n LIMIT {limit}
// subgraphNode will be all nodes in the subgraph including n
CALL apoc.path.expandConfig(n,{bfs:true, uniqueness:"NODE_GLOBAL"}) 
  YIELD path
WITH n, LAST(NODES(path)) as subgraphNode
REMOVE subgraphNode:Unprocessed
// find attach point node in each subgraph with smallest id
WITH n, min(id(subgraphNode)) as attachId
WITH DISTINCT attachId
MATCH (attachNode)
WHERE id(attachNode) = attachId
CREATE (attachNode)<-[:SUBGRAPH]-(:Subgraph)
RETURN count(*)
",{limit:100})

And the processing for each subgraph: 并且每个子图的处理:

MATCH (sub:Subgraph)
CALL apoc.path.expandConfig(sub,{minLevel:1, bfs:true, uniqueness:"NODE_GLOBAL"}) 
  YIELD path
WITH sub, LAST(NODES(path)) as subgraphNode
WITH sub, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM