Chaining values of two columns in Spark DataFrame (or DataSet)

Question

I have a table with two columns like below:

| a | b |

| 1 | 2 |

| 2 | 3 |

| 3 | 4 |

| 7 | 8 |

| 8 | 9 |

I would like to chain the rows where row1.b == row2.a and add row1.a, row2.b to the dataframe. Like (1, 2), (2, 3) -> add (1, 3) to list. This has to continue until I add columns like (1, 4) which is result of (1,3),(3,4) to the dataframe.

I can do this using count() and repeating a self join until the list is not growing anymore. However I am looking for a smarter way to do it without using count() which is an action and basically collects data.

Answer 1

This has to do more with graph handling then with dataframe. Spark has graphX library that can handle graph processing. More or less you want to find connected components from a graph structure.

if you have edgeDF, edge dataframe as :

+---+---+
|src|dst|
+---+---+
|1  |2  |
|2  |3  |
|3  |4  |
|7  |8  |
|8  |9  |
+---+---+

and vertexDF as:

+---+
|id |
+---+
|1  |
|2  |
|3  |
|4  |
|7  |
|8  |
|9  |
+---+

and your graph is:

val g = GraphFrame(vertexDF, edgeDF)

than you can run connected components on it

val cc = g.connectedComponents.run()

and it will give you something like this:

+---+------------+
|id |component   |
+---+------------+
|1  |171798691840|
|2  |171798691840|
|3  |171798691840|
|4  |171798691840|
|7  |807453851648|
|8  |807453851648|
|9  |807453851648|
+---+------------+

Means that [1, 2, 3, 4] are in the same component. Also [7,8,9] have their own component

Chaining values of two columns in Spark DataFrame (or DataSet)

Question

1 answers

solution1
1 2019-06-10 12:16:06

Chaining values of two columns in Spark DataFrame (or DataSet)

Question

1 answers

solution1 1 2019-06-10 12:16:06

solution1
1 2019-06-10 12:16:06