简体   繁体   English

Spark DataFrame(或DataSet)中两列的链接值

[英]Chaining values of two columns in Spark DataFrame (or DataSet)

I have a table with two columns like below: 我有一个包含两列的表格,如下所示:

| | a | 一个| b | b |

| | 1 | 1 | 2 | 2 |

| | 2 | 2 | 3 | 3 |

| | 3 | 3 | 4 | 4 |

| | 7 | 7 | 8 | 8 |

| | 8 | 8 | 9 | 9 |

I would like to chain the rows where row1.b == row2.a and add row1.a, row2.b to the dataframe. 我想将row1.b == row2.a的行链接起来,并将row1.a,row2.b添加到数据帧。 Like (1, 2), (2, 3) -> add (1, 3) to list. 像(1,2),(2,3)->将(1,3)添加到列表中。 This has to continue until I add columns like (1, 4) which is result of (1,3),(3,4) to the dataframe. 这必须继续进行,直到我将(1,3),(3,4)的结果添加到(1,4)列到数据帧中为止。

I can do this using count() and repeating a self join until the list is not growing anymore. 我可以使用count()并重复一次自我联接,直到列表不再增长为止。 However I am looking for a smarter way to do it without using count() which is an action and basically collects data. 但是,我正在寻找一种更聪明的方法,而无需使用count()这是一种操作,并且基本上可以收集数据。

This has to do more with graph handling then with dataframe. 与图形处理相比,这要与数据帧做更多的事情。 Spark has graphX library that can handle graph processing. Spark具有可以处理图形处理的graphX库。 More or less you want to find connected components from a graph structure. 您或多或少希望从图结构中找到连接的组件。

if you have edgeDF, edge dataframe as : 如果您有edgeDF,则将edge dataframe设置为:

+---+---+
|src|dst|
+---+---+
|1  |2  |
|2  |3  |
|3  |4  |
|7  |8  |
|8  |9  |
+---+---+

and vertexDF as: 和vertexDF为:

+---+
|id |
+---+
|1  |
|2  |
|3  |
|4  |
|7  |
|8  |
|9  |
+---+

and your graph is: 您的图形为:

val g = GraphFrame(vertexDF, edgeDF)

than you can run connected components on it 比您可以在其上运行连接的组件

val cc = g.connectedComponents.run()

and it will give you something like this: 它会给你这样的东西:

+---+------------+
|id |component   |
+---+------------+
|1  |171798691840|
|2  |171798691840|
|3  |171798691840|
|4  |171798691840|
|7  |807453851648|
|8  |807453851648|
|9  |807453851648|
+---+------------+ 

Means that [1, 2, 3, 4] are in the same component. 表示[1、2、3、4]在同一组件中。 Also [7,8,9] have their own component [7,8,9]也有自己的组成部分

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM