简体繁体 English

子图与neo4j和py2neo的慢速合并

[英]Slow merging of subgraph with neo4j and py2neo

原文 2018-04-25 14:47:53 1 1 python/ performance/ neo4j/ benchmarking/ py2neo

I am working on a project where I need to perform many merge operations of subgraphs onto a remote graph. 我正在开发一个项目，我需要在子图上执行许多合并操作到远程图形上。 Some elements of the subgraph may already exist in the remote graph. 子图的某些元素可能已存在于远程图中。 I am using py2neo v3, and neo4j. 我正在使用py2neo v3和neo4j。

I tried using the create and merge function of neo4j, and with both I get surprisingly bad performances. 我尝试使用neo4j的create和merge功能，并且两者都得到了令人惊讶的糟糕表现。 Even more surprising, the time taken to merge the subgraph seems to grow quadratically both with the number of nodes and the number of relationships! 更令人惊讶的是，合并子图的时间似乎与节点数量和关系数量呈二次方增长！ When the subgraph is too big, the transaction hangs. 当子图太大时，事务就会挂起。 One thing I should say is that I checked and it is not py2neo that generates a number of cypher statements that grows quadratically with the size of the subgraph. 我应该说的一件事是我检查过并且不是py2neo会生成许多cypher语句，这些语句与子图的大小成二次方。 So if something is wrong, it is either with how I am using those technologies, or with neo4j's implementation. 因此，如果出现问题，可能是我使用这些技术，还是使用neo4j的实现。 I also tried looking at the query plans for the queries generated by py2neo, and did not find any answer in that as to why the query times grow so dramatically, but don't take my word for it since I am relatively non-initiated. 我也尝试查看py2neo生成的查询的查询计划，并没有找到任何答案，为什么查询时间如此显着增长，但由于我相对不启动，所以不要接受我的话。

I could hardly find any relevant information on-line so I tried conducting a proper benchmarking where I compared the performances in function of the number of nodes, and topology of the subgraph, depending on whether of I use the merge or create operation and whether I use unique constraints or not. 我几乎无法在线找到任何相关信息，所以我尝试进行适当的基准测试，比较节点数量和子图拓扑功能的性能，具体取决于我是使用合并还是创建操作，是否我是否使用独特约束。 I include below some of the results I got for graphs with "linear" topology, meaning that the number of relationships is roughly the same as the number of nodes (it doesn't grow quadratically). 我在下面列出了我对“线性”拓扑图得到的一些结果，这意味着关系数与节点数大致相同（它不会以二次方式增长）。 In my benchmark, I use 5 different types of labels for nodes and relationships that I assign randomly, and reuse 30% of nodes that already exist in the remote graph. 在我的基准测试中，我为随机分配的节点和关系使用了5种不同类型的标签，并重用了远程图中已存在的30％的节点。 The nodes I create have only one property that act as an identifier, and I report the performances depending on whether I add a unique constraint on this property or not. 我创建的节点只有一个属性作为标识符，我根据是否在此属性上添加唯一约束来报告性能。 All the merging operations are run within a single transaction. 所有合并操作都在单个事务中运行。

Query times for graphs with a linear topology in function of the number of nodes, using py2neo create function 使用py2neo create函数查询具有线性拓扑结构的图形的节点数量

Query times for graphs with a linear topology in function of the number of nodes, using py2neo merge function 使用py2neo合并函数查询具有线性拓扑的图的节点数的函数

As you can see, the time taken seems to grow quadratically with the number of nodes (and relationships). 如您所见，所花费的时间似乎与节点（和关系）的数量呈二次方式增长。

The question I am having a hard time answering is whether I do something wrong, or don't do something that I should, or if it the kind of performances we should expect from neo4j for these kind of operations. 我很难回答的问题是我是做错了什么，还是做了我应该做的事情，或者我们应该期待neo4j对这些操作的表现。 Regardless, it seems that what I could do to alleviate this performance issue is to never try merging big subgraphs all at once, but rather start by merging the nodes batch by batch then the relationships. 无论如何，似乎我可以做的就是不要一次尝试合并大的子图，而是从批量合并节点然后关系开始。 This could and would work, but I want to get at the bottom of this, if someone has any recommendation or insight to share. 这可能并且可行，但如果有人有任何建议或见解可以分享，我想深究这一点。

Edit 编辑

Here is a list to a gist to reproduce the results above, and others. 以下列出了重现上述结果的要点和其他内容。 https://gist.github.com/alreadytaikeune/6be006f0a338502524552a9765e79af6 https://gist.github.com/alreadytaikeune/6be006f0a338502524552a9765e79af6

Edit 2 编辑2

Following Michael Hunger's questions: 关注Michael Hunger的问题：

In the code I shared, I tried to write a formatter for neo4j.bolt logs, in order to capture the queries that are sent to the server. 在我分享的代码中，我尝试为neo4j.bolt日志编写格式化程序，以捕获发送到服务器的查询。 I don't have a systematic way to generate query plans for them however. 但是，我没有系统的方法为它们生成查询计划。

I did not try without docker and I don't have an SSD. 我没有尝试没有docker，我没有SSD。 However, considering the size I allocate for the jvm and size of the graph I am handling, everything should fit in RAM. 但是，考虑到我为jvm分配的大小和我正在处理的图形的大小，一切都应该适合RAM。

I use the latest docker image for neo4j, so the corresponding version seems to be 3.3.5 我使用neo4j的最新docker镜像，所以相应的版本似乎是3.3.5

1 个解决方案

Unfortunately, the merge routine (and a few others) in v3 are a little naive and don't scale well. 不幸的是，v3中的合并例程（以及其他一些例程）有点天真并且不能很好地扩展。 I have alternatives planned for py2neo v4 that build much more efficient queries instead of (in the case of merge) arbitrarily long sequences of MERGE statements. 我为py2neo v4计划了替代方案，它构建了更高效的查询，而不是（在合并的情况下）任意长的MERGE语句序列。 Version 4 should be released at some point next month (May 2018). 版本4应该在下个月（2018年5月）的某个时候发布。