简体   繁体   中英

Spark Dataframe - Get all lists of pairs (Scala)

I have the following situation: I have a dataframe with an 'array' as the schema. Now I want to get for each array, all lists of pairs and save it again in a dataframe. So for example:

This is the original dataframe:

+---------------+
|  candidateList|
+---------------+
|         [1, 2]|
|      [2, 3, 4]|
|      [1, 3, 5]|
|[1, 2, 3, 4, 5]|
|[1, 2, 3, 4, 5]|
+---------------+

And that is how it have to look like after the computation:

+---------------+
|  candidates   |
+---------------+
|         [1, 2]|
|         [2, 3]|
|         [2, 4]|
|         [3, 4]|
|         [1, 3]|
|         [1, 5]|
|         [3, 5]|
|and so on...   |
+---------------+

I really don't know how this is possible in spark, maybe someone has a tip for me.

Kind regards

You'll need to create a UDF (User Defined Function) and use it with explode function. The UDF itself is simple thanks to Scala collection's combinations method:

import scala.collection.mutable
import org.apache.spark.sql.functions._
import spark.implicits._

val pairsUdf = udf((arr: mutable.Seq[Int]) => arr.combinations(2).toArray)
val result = df.select(explode(pairsUdf($"candidateList")) as "candidates")

result.show(numRows = 8)
// +----------+
// |candidates|
// +----------+
// |    [1, 2]|
// |    [2, 3]|
// |    [2, 4]|
// |    [3, 4]|
// |    [1, 3]|
// |    [1, 5]|
// |    [3, 5]|
// |    [1, 2]|
// +----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM