简体   繁体   English

在大型Neo4J数据库中查找特定长度的路径:内存性能

[英]Find Path of Specific Length in Large-ish Neo4J Database: Memory Performance

I have a Neo4J instance running with the Neo4J Spatial plugin. 我有一个运行Neo4J Spatial插件的Neo4J实例。 In it, I have a graph with around 3.5 k nodes each with the same label, we'll call Basket. 在其中,我有一个大约有3.5 k个节点的图,每个节点都带有相同的标签,我们将其称为“篮子”。 Each Basket relates to a physical location in the same city, and the density of these baskets is very variable. 每个篮子与同一城市中的实际位置相关,并且这些篮子的密度变化很大。 I have calculated walking times between each Basket and all of its neighbours within 600m, and stored these as non-spatial (directed) relationships between nodes. 我已经计算出每个篮筐与其所有邻居之间在600m之内的步行时间,并将其存储为节点之间的非空间(定向)关系。 Thus, some Baskets exist as what seems to be part of a big cluster, and others exist almost on their own, with only one or almost no relationships to other Baskets. 因此,有些篮子作为一个大集群的一部分而存在,而另一些篮子则几乎独立存在,与其他篮子只有一个或几乎没有关系。

My users have a problem: they wish to begin in one place, and end in another place, visiting an arbitrary, user-defined, number of Baskets along the way. 我的用户有一个问题:他们希望从一个地方开始,然后到另一个地方结束,一路访问用户定义的任意数量的购物篮。 My program aims to provide a few route options for the user (as a sequence of nodes - I'll sort the actual how-to-walk-there part later), calculating the n-th number of shortest paths. 我的程序旨在为用户提供一些路线选项(作为一系列节点-我稍后将对实际的步行方式进行排序),计算最短路径的第n个。

I've written a cypher query to do this, below. 我在下面编写了一个密码查询来执行此操作。

start a = node(5955), b=node(6497) 
WITH a,b 
    MATCH p=((a)-[r:IS_WALKABLE_TO*4..5]->(b)) 
RETURN p

NB - nodes 5955 and 6497 are two nodes I picked about 2 miles apart, in this instance I decided to opt for between 4 and 5 baskets along the way. 注意 :节点59556497是我在相距约2英里处挑选的两个节点,在这种情况下,我决定沿途选择4到5个篮子。

However, I keep running into an out of memory exception, and so would like advice on how to reduce the memory demand for this problem to make it perform on an affordable server in an acceptable time of 1 to 6 seconds. 但是,我一直遇到内存不足异常,因此希望获得有关如何减少此问题的内存需求的建议,以使其在1到6秒的可接受时间内在负担得起的服务器上执行。

My understanding is that Neo4j would not perform a Cartesian Product to find the solution, but kind of "pick each node and sniff around from each one until it finds a suitable-sized connection" (please, forgive my phrasing!), so I'm confused about the heap memory error. 我的理解是Neo4j不会执行笛卡尔积来找到解决方案,而是一种“拾取每个节点并在每个节点之间嗅探,直到找到合适大小的连接”(请原谅我的措辞!),所以我我对堆内存错误感到困惑。

My thoughts for improving the program are to: 我对改进程序的想法是:

  1. Somehow restrict the path-finding part of the query to nodes within a bounding box, determined by the placing of the start and end node (ie, add 500 m in each direction, then limit the query to these nodes). 以某种方式将查询的寻路部分限制在边界框中的节点上,该边界框由开始和结束节点的位置确定(即,在每个方向上增加500 m,然后将查询限制到这些节点)。 However, I can't find any documentation on how to do this - is it possible without having to create another spatial layer for each query? 但是,我找不到有关如何执行此操作的任何文档-是否有可能不必为每个查询创建另一个空间层?

  2. Re-write the query in a way which doesn't create a memory error - is this doable easily? 用不会造成内存错误的方式重新编写查询-这样容易吗?

  3. Stop using Neo4J for this entirely and write an algorithm to do it manually using an alternative language. 完全停止使用Neo4J,并编写算法以使用其他语言手动进行操作。 If so, what language would you recommend? 如果是这样,您会推荐哪种语言? C? C? C++ / C#? C ++ / C#? Or could I stick with Python / Ruby / Java / Go? 还是我可以坚持使用Python / Ruby / Java / Go? (or, I was even thinking I might be able to do it in PHP quite effectively but I'm not sure if that was a moment of madness). (或者,我什至甚至以为我可以用PHP相当有效地做到这一点,但是我不确定那是否是疯狂的时刻)。

Any help and advice about how to tackle this much appreciated! 任何有关如何解决此问题的帮助和建议,深表感谢!

You might be better off refactoring this Cypher query into Java code into an unmanaged extension . 您最好将这个Cypher查询重构为Java代码,成为非托管扩展 Your java code might then use either Traversal API or GraphAlgoFactory.pathsWithLength() 然后,您的Java代码可能会使用Traversal API或GraphAlgoFactory.pathsWithLength()

I think due to the densely connected shape of your graph you easily end up with hundreds of millions of possible path due to duplicate intermediate nodes. 我认为由于图形的密集连接形状,由于中间节点重复,您很容易最终获得数亿条可能的路径。

You should add a LIMIT 100 to your query then it stops searching for paths. 您应该在查询中添加LIMIT 100 ,然后它会停止搜索路径。

One other idea is to rewrite your query to first find distinct starting points around a (and potentially b ). 另一个想法是重写查询以首先找到围绕a (可能还有b )的不同起点。

start a = node(5955), b=node(6497) 
MATCH (a)-[:IS_WALKABLE_TO]->(a1)-[:IS_WALKABLE_TO]->(a2)
WITH a, b, a2, collect(a1) as first
MATCH p = shortestPath((a2)-[:IS_WALKABLE_TO*..2]->(b)) 
RETURN count(*)

// or
UNWIND first as a1
RETURN [a,a1] + nodes(p) as path

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM