Cypher MATCH查询速度

Question

I have Neo4j installed on a windows machine with 12 processors and 64GB ram. 我在装有12个处理器和64GB内存的Windows机器上安装了Neo4j。 I did not change any of the memory settings that Neo4j allows for. 我没有更改Neo4j允许的任何内存设置。

My database has 3.8m nodes, 210,000 of which are labeled as Geotagged and a total of 650,000 relationships. 我的数据库有380万个节点，其中210,000个被标记为Geotagged，共有650,000个关系。 I am trying to run the following query and I am wondering if this is a really intensive query that will likely take quite a while. 我正在尝试运行以下查询，我想知道这是否是一个非常密集的查询，可能会花费相当长的时间。

Messages.csv is my relationship file. Messages.csv是我的关系文件。 The relationships have already been created, but as I could not figure out how to combine the relationship creation with the below Distance generation, I am loading and running through the relationship file twice. 关系已经创建，但是由于我不知道如何将关系创建与下面的“距离生成”结合使用，因此我两次加载并运行关系文件。

USING PERIODIC COMMIT 15000
LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

The initial relationship generation takes about 3-5 minutes. 初始关系生成大约需要3-5分钟。 I let the above run for over an hour and it still was not complete. 我让以上运行了一个多小时，但仍未完成。 I ran a similar algorithm (though it had a few more trig calls in it) on the same initial db and let it run for over 18 hours and still had not completed. 我在相同的初始db上运行了类似的算法（尽管其中有更多的trig调用），并使其运行了18个小时以上，但仍未完成。

My question: Is this a very intensive query? 我的问题：这是一个非常密集的查询吗？ Am I not giving it enough time? 我没有给它足够的时间吗？ And more importantly, is there a way I can optimize this? 更重要的是，有没有一种方法可以优化这一点？

I tried adding "WHERE NOT HAS(r.Distance)" to exclude node pairs that the algorithm has already set the Distance on, though I am unsure if the MATCH is a one-time match or if it will MATCH for each line in the CSV file? 我尝试添加“ WHERE NOT HAS（r.Distance）”以排除该算法已将Distance设置为on的节点对，尽管我不确定MATCH是否为一次性匹配，或者是否会匹配CSV文件？

Any thoughts on this would really be appreciated. 任何对此的想法将不胜感激。

Answer 1

One way that I would start to debug is to put a limit on it using WITH : 我开始调试的一种方法是使用WITH限制它：

USING PERIODIC COMMIT 15000
LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

With that you can change the LIMIT number to see how the performance degrades as the limit increases. 这样，您可以更改LIMIT号，以查看性能随着极限的增加而降低。

Also, is the username property indexes for the Geotagged label? 另外，是否为Geotagged标签提供了username属性索引？ If not it definitely should be, like this: 如果不是这样，肯定是这样的：

CREATE INDEX ON :Geotagged(username)

If it's unique and you want the database to enforce that: 如果它是唯一的，并且您希望数据库强制执行该操作：

CREATE CONSTRAINT ON (g:Geotagged) ASSERT g.username IS UNIQUE

Answer 2

This is additional to Brian's reply: 这是Brian的回复的补充内容：

Your statement's query plan shows EAGER , to verify run 语句的查询计划显示EAGER ，以验证运行

EXPLAIN explain LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 *asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

渴望的查询计划

Eagerness in LOAD CSV is pretty bad, see the these blog posts why: LOAD CSV急切性很差，请参阅以下博客文章，原因：

Following Mark's suggested and replacing the MATCH/SET with a MERGE ON MATCH SET we can refactor that into: 按照Mark的建议，将MATCH/SET替换为MERGE ON MATCH SET我们可以将其重构为：

explain LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }), (b:Geotagged { username: line.recipient })
MERGE (a)-[r:MSGED]->(b)
ON MATCH SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

And eager has vanished. eager消失了。 急切的查询计划

Cypher MATCH查询速度

问题描述

2 个解决方案

解决方案1
2 2015-04-08 18:24:54

解决方案2
2 2015-04-08 19:42:26

Cypher MATCH查询速度

问题描述

2 个解决方案

解决方案1 2 2015-04-08 18:24:54

解决方案2 2 2015-04-08 19:42:26

解决方案1
2 2015-04-08 18:24:54

解决方案2
2 2015-04-08 19:42:26