简体   繁体   English

Gremlin:AWS Neptune - 获取图中每个节点的所有叶节点作为 CSV

[英]Gremlin : AWS Neptune - Get all Leaf Nodes for each Node in the Graph as CSV

I have a simple graph that has Nodes which represent duplicate record id in the below form我有一个简单的图表,其中有代表以下形式的重复记录 ID 的节点

Duplicate Id, Original Id
A,B
B,C
C,D
X,Y
Y,Z

The directed graph looks like A -> B ->C ->D and I want CSV result that looks like below that has each Node with ultimate leaf node with no more outgoing edges有向图看起来像 A -> B ->C ->D 我想要 CSV 结果如下所示,每个节点都有最终叶节点,没有更多的传出边

A,D
B,D
C,D
X,Z
Y,Z

The above is a simplistic scenario to explain the problem however actual data will have more complex scenarios like below where I have 24 Nodes from A to X with each node connect to other 23 having 24C2=276 directed edges and for each of the 24 nodes I require the ultimate node that has no more outgoing edges.上面是一个解释问题的简单场景,但是实际数据会有更复杂的场景,如下所示,我有从 A 到 X 的 24 个节点,每个节点连接到其他 23 个具有 24C2=276 条有向边的节点,对于 24 个节点中的每一个,我需要没有更多出边的最终节点。

A,B
A,C
A,D
A,E
A,F
A,G
A,H
A,I
A,J
A,K
A,L
A,M
A,N
A,O
A,P
A,Q
A,R
A,S
A,T
A,U
A,V
A,W
A,X
B,C
B,D
B,E
B,F
B,G
B,H
B,I
B,J
B,K
B,L
B,M
B,N
B,O
B,P
B,Q
B,R
B,S
B,T
B,U
B,V
B,W
B,X
C,D
C,E
C,F
C,G
C,H
C,I
C,J
C,K
C,L
C,M
C,N
C,O
C,P
C,Q
C,R
C,S
C,T
C,U
C,V
C,W
C,X
D,E
D,F
D,G
D,H
D,I
D,J
D,K
D,L
D,M
D,N
D,O
D,P
D,Q
D,R
D,S
D,T
D,U
D,V
D,W
D,X
E,F
E,G
E,H
E,I
E,J
E,K
E,L
E,M
E,N
E,O
E,P
E,Q
E,R
E,S
E,T
E,U
E,V
E,W
E,X
F,G
F,H
F,I
F,J
F,K
F,L
F,M
F,N
F,O
F,P
F,Q
F,R
F,S
F,T
F,U
F,V
F,W
F,X
G,H
G,I
G,J
G,K
G,L
G,M
G,N
G,O
G,P
G,Q
G,R
G,S
G,T
G,U
G,V
G,W
G,X
H,I
H,J
H,K
H,L
H,M
H,N
H,O
H,P
H,Q
H,R
H,S
H,T
H,U
H,V
H,W
H,X
I,J
I,K
I,L
I,M
I,N
I,O
I,P
I,Q
I,R
I,S
I,T
I,U
I,V
I,W
I,X
J,K
J,L
J,M
J,N
J,O
J,P
J,Q
J,R
J,S
J,T
J,U
J,V
J,W
J,X
K,L
K,M
K,N
K,O
K,P
K,Q
K,R
K,S
K,T
K,U
K,V
K,W
K,X
L,M
L,N
L,O
L,P
L,Q
L,R
L,S
L,T
L,U
L,V
L,W
L,X
M,N
M,O
M,P
M,Q
M,R
M,S
M,T
M,U
M,V
M,W
M,X
N,O
N,P
N,Q
N,R
N,S
N,T
N,U
N,V
N,W
N,X
O,P
O,Q
O,R
O,S
O,T
O,U
O,V
O,W
O,X
P,Q
P,R
P,S
P,T
P,U
P,V
P,W
P,X
Q,R
Q,S
Q,T
Q,U
Q,V
Q,W
Q,X
R,S
R,T
R,U
R,V
R,W
R,X
S,T
S,U
S,V
S,W
S,X
T,U
T,V
T,W
T,X
U,V
U,W
U,X
V,W
V,X
W,X

Here is a graphical visualization using Neo4j -这是使用 Neo4j 的图形可视化 - 视觉图

For the above case each Node A to W will have X as the ultimate Leaf Node.对于上述情况,每个节点 A 到 W 都将 X 作为最终叶节点。

There will also be Cyclic loops which I need to avoid in the overall solution.在整个解决方案中还会有我需要避免的循环。 It may be too much to have everything in one step however will appreciate guidance.一步完成所有事情可能太多了,但是会感谢指导。

Update: 2020-10-15 Traversal Optimization need to optimize execution in finding Path from starting Node to Leaf Node更新:2020-10-15 Traversal Optimization需要优化执行寻找从起始节点到叶节点的路径

For the Data Scenario below, The result for Vertex A and B should be对于下面的数据场景,顶点 A 和 B 的结果应该是

A,G
A,H
B,G
B,H

在此处输入图像描述

The shortest path from A to G is A->C->E->G; A到G的最短路径是A->C->E->G; is it possible to suppress any further traversals from A to G when any 1 shortest path to a leaf node is found?当找到到叶节点的任何 1 条最短路径时,是否可以抑制从 A 到 G 的任何进一步遍历? else it will lead to unwanted execution especially for larger clusters of connected Nodes.否则会导致不必要的执行,尤其是对于连接节点的较大集群。 This steps needs to be repeated again from A to H whose path will be A->C->F->H and then prevent any further attempts to find paths between A and H.需要从 A 到 H 再次重复此步骤,其路径将为 A->C->F->H,然后阻止任何进一步尝试寻找 A 和 H 之间的路径。

Thanks谢谢

Based on what you have provided, the following graph can be built.根据您提供的内容,可以构建以下图表。

g.addV('A').as('a').
  addV('B').as('b').
  addV('C').as('c').
  addV('D').as('d').
  addV('X').as('x').
  addV('Y').as('y').
  addV('Z').as('z').
  addE('dup').from('a').to('b').
  addE('dup').from('b').to('c').
  addE('dup').from('c').to('d').
  addE('dup').from('x').to('y').
  addE('dup').from('y').to('z').
  iterate()  

and the following query used to find the results以及用于查找结果的以下查询

gremlin> g.V().
           repeat(out().simplePath()).
           until(__.not(out())).
           path().
             by(label).
           local(
             unfold().
             union(
               limit(1),
               tail()).
             fold())


==>[A,D]
==>[B,D]
==>[C,D]
==>[X,Z]
==>[Y,Z] 

UPDATED Oct 11th 2020 2020 年 10 月 11 日更新

Using the larger data set you provided, I tweaked the query a bit so that for each starting vertex only one path to a leaf is found.使用您提供的更大的数据集,我稍微调整了查询,以便对于每个起始顶点,只找到一条到叶子的路径。 This runs very quickly.这运行得非常快。 Without doing this, starting at say 'A', there are literally millions of unique paths that end up at 'X' which is why the previous query becomes so complex.如果不这样做,从“A”开始,实际上会有数百万个以“X”结束的唯一路径,这就是为什么前面的查询变得如此复杂。

gremlin> g.V().
......1>   local(
......2>     repeat(out().simplePath()).
......3>     until(__.not(out())).
......4>     path().
......5>       by(label).
......6>       limit(1)).
......7>   local(
......8>     unfold().
......9>     union(
.....10>       limit(1),
.....11>       tail()).
.....12>     fold())

==>[A,X]
==>[B,X]
==>[C,X]
==>[D,X]
==>[E,X]
==>[F,X]
==>[G,X]
==>[H,X]
==>[I,X]
==>[J,X]
==>[K,X]
==>[L,X]
==>[M,X]
==>[N,X]
==>[O,X]
==>[P,X]
==>[Q,X]
==>[R,X]
==>[S,X]
==>[T,X]
==>[U,X]
==>[V,X]
==>[W,X]

Purely for interest the following query shows the highly connected fan out of the graph.纯粹出于兴趣,以下查询显示了图中高度连接的扇形。

gremlin> g.V().group().by(label).by(local(out().label().order().fold())).unfold()

==>A=[B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>B=[C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>C=[D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>D=[E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>E=[F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>F=[G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>G=[H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>H=[I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>I=[J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>J=[K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>K=[L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>L=[M, N, O, P, Q, R, S, T, U, V, W, X]
==>M=[N, O, P, Q, R, S, T, U, V, W, X]
==>N=[O, P, Q, R, S, T, U, V, W, X]
==>O=[P, Q, R, S, T, U, V, W, X]
==>P=[Q, R, S, T, U, V, W, X]
==>Q=[R, S, T, U, V, W, X]
==>R=[S, T, U, V, W, X]
==>S=[T, U, V, W, X]
==>T=[U, V, W, X]
==>U=[V, W, X]
==>V=[W, X]
==>W=[X]
==>X=[]

and the counts (multiplying out those numbers gives a large number) which explains why finding all paths from 'A' is an expensive query.和计数(将这些数字相乘得到一个很大的数字)这解释了为什么从“A”找到所有路径是一个昂贵的查询。 Note that the simplePath step helps us by making sure we do not follow any cycles.请注意, simplePath步骤通过确保我们不遵循任何循环来帮助我们。 The number of paths from any vertex to 'X' in this data set ends up being 2^(C-1) where C is the number in the list below for a given start.此数据集中从任何顶点到“X”的路径数最终为 2^(C-1),其中 C 是下面列表中给定起点的编号。

gremlin> g.V().group().by(label).by(local(out().count())).unfold()

==>A=23
==>B=22
==>C=21
==>D=20
==>E=19
==>F=18
==>G=17
==>H=16
==>I=15
==>J=14
==>K=13
==>L=12
==>M=11
==>N=10
==>O=9
==>P=8
==>Q=7
==>R=6
==>S=5
==>T=4
==>U=3
==>V=2
==>W=1
==>X=0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM