简体   繁体   中英

Chaining values of two columns in Spark DataFrame (or DataSet)

I have a table with two columns like below:

| a | b |

| 1 | 2 |

| 2 | 3 |

| 3 | 4 |

| 7 | 8 |

| 8 | 9 |

I would like to chain the rows where row1.b == row2.a and add row1.a, row2.b to the dataframe. Like (1, 2), (2, 3) -> add (1, 3) to list. This has to continue until I add columns like (1, 4) which is result of (1,3),(3,4) to the dataframe.

I can do this using count() and repeating a self join until the list is not growing anymore. However I am looking for a smarter way to do it without using count() which is an action and basically collects data.

This has to do more with graph handling then with dataframe. Spark has graphX library that can handle graph processing. More or less you want to find connected components from a graph structure.

if you have edgeDF, edge dataframe as :

+---+---+
|src|dst|
+---+---+
|1  |2  |
|2  |3  |
|3  |4  |
|7  |8  |
|8  |9  |
+---+---+

and vertexDF as:

+---+
|id |
+---+
|1  |
|2  |
|3  |
|4  |
|7  |
|8  |
|9  |
+---+

and your graph is:

val g = GraphFrame(vertexDF, edgeDF)

than you can run connected components on it

val cc = g.connectedComponents.run()

and it will give you something like this:

+---+------------+
|id |component   |
+---+------------+
|1  |171798691840|
|2  |171798691840|
|3  |171798691840|
|4  |171798691840|
|7  |807453851648|
|8  |807453851648|
|9  |807453851648|
+---+------------+ 

Means that [1, 2, 3, 4] are in the same component. Also [7,8,9] have their own component

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM