简体   繁体   English

Neo4j Cypher查询查找未连接的节点太慢

[英]Neo4j Cypher query to find nodes that are not connected too slow

Given we have the following Neo4j schema (simplified but it shows the important point). 鉴于我们有以下Neo4j架构(简化但它显示了重要的一点)。 There are two types of nodes NODE and VERSION . 有两种类型的节点NODEVERSION VERSION s are connected to NODE s via a VERSION_OF relationship. VERSION通过VERSION_OF关系连接到NODE VERSION nodes do have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited . VERSION节点都具有两个属性fromuntil该表示的有效性时间跨度-一方或双方可NULL (在Neo4j的条件不存在)表示无限 NODE s can be connected via a HAS_CHILD relationship. NODE可以通过HAS_CHILD关系连接。 Again these relationships have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited . 同样,这些关系具有两个属性fromuntil该表示的有效性时间跨度-一方或双方可NULL (在Neo4j的条件不存在)表示无限

EDIT : The validity dates on VERSION nodes and HAS_CHILD relations are independent (even though the example coincidentally shows them being aligned). 编辑VERSION节点和HAS_CHILD关系的有效日期是独立的(即使示例巧合地显示它们是对齐的)。

在此输入图像描述

The example shows two NODE s A and B . 该示例显示了两个NODE AB. A has two VERSION s AV1 until 6/30/17 and AV2 starting from 7/1/17 while B only has one version BV1 that is unlimited. A有两个VERSION s AV1直到6/30/17和AV2从7/1/17开始,而B只有一个版本BV1无限制。 B is connected to A via a HAS_CHILD relationship until 6/30/17. B通过HAS_CHILD关系连接到A ,直到6/30/17。

The challenge now is to query the graph for all nodes that aren't a child (that are root nodes) at one specific moment in time. 现在的挑战是在一个特定时刻查询所有不是节点(即根节点)的节点。 Given the example above, the query should return just B if the query date is eg 6/1/17, but it should return B and A if the query date is eg 8/1/17 (because A isn't a child of B as of 7/1/17 any more). 鉴于上面的例子中,查询应该返回刚才b。如果查询日期,例如17年6月1日,但它应该返回BA,如果查询的日期是如17年8月1日(因为A是不是一个孩子B截至7/1/17更多)。

The current query today is roughly similar to that one: 今天的当前查询大致类似于那个:

MATCH (n1:NODE)
OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION)
WHERE (c.from <= {date} <= c.until)
AND (nv2.from <= {date} <= nv2.until)
WITH n1 WHERE c IS NULL 
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
RETURN n1, nv1 
ORDER BY toLower(nv1.title) ASC 
SKIP 0 LIMIT 15

This query works relatively fine in general but it starts getting slow as hell when used on large datasets (comparable to real production datasets). 这个查询通常工作得相对较好,但是当它用在大型数据集上时(与真实的生产数据集相比),它开始变慢。 With 20-30k NODE s (and about twice the number of VERSION s) the (real) query takes roughly 500-700 ms on a small docker container running on Mac OS X) which is acceptable. 使用20-30k NODE (大约是VERSION的两倍),(实际)查询在Mac OS X上运行的小型docker容器上大约需要500-700 ms),这是可以接受的。 But with 1.5M NODE s (and about twice the number of VERSION s) the (real) query takes a little more than 1 minute on a bare-metal server (running nothing else than Neo4j). 但是使用1.5M NODE (并且大约是VERSION的两倍),(真实的)查询在裸机服务器上运行时间超过1分钟(除了Neo4j之外别无其他)。 This is not really acceptable. 这不是真的可以接受。

Do we have any option to tune this query? 我们可以选择调整此查询吗? Are there better ways to handle the versioning of NODE s (which I doubt is the performance problem here) or the validity of relationships? 是否有更好的方法来处理NODE的版本控制(我怀疑这里是性能问题)还是关系的有效性? I know that relationship properties cannot be indexed, so there might be a better schema for handling the validity of these relationships. 我知道关系属性无法编入索引,因此可能有更好的模式来处理这些关系的有效性。

Any help or even the slightest hint is greatly appreciated. 非常感谢任何帮助甚至是丝毫的暗示。

EDIT after answer from Michael Hunger : 迈克尔·亨格回答后编辑:

  1. Percentage of root nodes: 根节点的百分比:

    With the current example data set (1.5M nodes) the result set contains about 2k rows. 使用当前示例数据集(1.5M节点),结果集包含大约2k行。 That's less than 1%. 那不到1%。

  2. ITEM_VERSION node in first MATCH : 第一个MATCH ITEM_VERSION节点:

    We're using the ITEM_VERSION nv2 to filter the result set to ITEM nodes that have no connection other ITEM nodes at the given date. 我们使用的ITEM_VERSION nv2来筛选结果集ITEM有没有联系其他节点ITEM在给定的时间节点。 That means that either no relationship must exist that is valid for the given date or the connected item must not have an ITEM_VERSION that is valid for the given date. 这意味着要么不存在对于给定日期有效的关系,要么连接的项目不能具有对给定日期有效的ITEM_VERSION I'm trying to illustrate this: 我试图说明这一点:

     // date 6/1/17 // n1 returned because relationship not valid (nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...) // n1 not returned because relationship and connected item n2 valid (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...) // n1 returned because connected item n2 not valid even though relationship is valid (nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17) 
  3. No use of relationship-types: 不使用关系类型:

    The problem here is that the software features a user-defined schema and ITEM nodes are connected by custom relationship-types. 这里的问题是该软件具有用户定义的模式, ITEM节点通过自定义关系类型连接。 As we can't have multiple types/labels on a relationship the only common characteristic for these kind of relationships is that they all start with X_ . 由于我们不能在关系上有多个类型/标签,这种关系的唯一共同特征是它们都以X_开头。 That's been left out of the simplified example here. 这里没有简化的例子。 Would searching with the predicate type(r) STARTS WITH 'X_' help here? 使用谓词type(r) STARTS WITH 'X_'帮助吗?

What Neo4j version are you using. 你正在使用什么Neo4j版本。

What percentage of your 1.5M nodes will be found as roots at your example date, and if you don't have the limit how much data comes back? 在您的示例日期,您的1.5M节点中有多少百分比将作为根发现,如果您没有限制,那么有多少数据会返回? Perhaps the issue is not in the match so much as in the sorting at the end? 也许这个问题不是在匹配中,而是在最后的排序中?

I'm not sure why you had the VERSION nodes in your first part, at least you don't describe them as relevant for determining a root node. 我不确定为什么你的第一部分中有VERSION节点,至少你没有将它们描述为确定根节点的相关性。

You didn't use relationship-types. 你没有使用关系类型。

MATCH (n1:NODE) // matches 1.5M nodes
// has to do 1.5M * degree optional matches
OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until)
WITH n1 WHERE c IS NULL
// how many root nodes are left?
// # root nodes * version degree (1..2)
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
// has to sort all those
WITH n1, nv1, toLower(nv1.title) as title
RETURN n1, nv1
ORDER BY title ASC 
SKIP 0 LIMIT 15

I think a good start for improvement would be to match on nodes using an index so you can quickly get a smaller relevant subset of nodes to search. 我认为改进的良好开端是使用索引匹配节点,这样您就可以快速获得较小的相关节点子集进行搜索。 Your approach right now must inspect all your :NODEs and all their relationships and patterns off of them every single time, which, as you've found, won't scale with your data. 您现在的方法必须每次都检查您的所有:NODE以及它们之间的所有关系和模式,正如您所发现的那样,它们不会随您的数据而扩展。

Right now the only nodes in your graph with date/time properties are your :ITEM_VERSION nodes, so let's start with those. 现在,图中唯一具有日期/时间属性的节点是:ITEM_VERSION节点,所以让我们从这些节点开始。 You'll need an index on :ITEM_VERSION's from and until properties for fast lookup. 您将需要一个索引:ITEM_VERSION的from和until属性用于快速查找。

The nulls are going to be problematic for your lookups, as any inequality against a null value returns null, and most workarounds to working with nulls (using COALESCE() or several ANDs/ORs for null cases) seem to prevent usage of index lookups, which is the point of my particular suggestion. 空值对于查找会有问题,因为任何针对空值的不等式都会返回null,并且大多数使用空值的变通方法(使用COALESCE()或多个AND / OR用于空案例)似乎会阻止使用索引查找,这是我特别建议的要点。

I would encourage you to replace your nulls in from and until with min and max values, which should let you take advantage of finding nodes by index lookup: 我鼓励你用min和max值替换你的null,直到min和max值,这可以让你利用索引查找来查找节点:

MATCH (version:ITEM_VERSION)
WHERE version.from <= {date} <= version.until
MATCH (version)<-[:VERSION_OF]-(node:NODE)
...

That should at least provide quick access to a smaller subset of nodes at the start for continuing your query. 这应该至少可以在开始时快速访问较小的节点子集以继续查询。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM