I have the following situation: I have a dataframe with an 'array' as the schema. Now I want to get for each array, all lists of pairs and save it again in a dataframe. So for example:
This is the original dataframe:
+---------------+
| candidateList|
+---------------+
| [1, 2]|
| [2, 3, 4]|
| [1, 3, 5]|
|[1, 2, 3, 4, 5]|
|[1, 2, 3, 4, 5]|
+---------------+
And that is how it have to look like after the computation:
+---------------+
| candidates |
+---------------+
| [1, 2]|
| [2, 3]|
| [2, 4]|
| [3, 4]|
| [1, 3]|
| [1, 5]|
| [3, 5]|
|and so on... |
+---------------+
I really don't know how this is possible in spark, maybe someone has a tip for me.
Kind regards
You'll need to create a UDF (User Defined Function) and use it with explode
function. The UDF itself is simple thanks to Scala collection's combinations
method:
import scala.collection.mutable
import org.apache.spark.sql.functions._
import spark.implicits._
val pairsUdf = udf((arr: mutable.Seq[Int]) => arr.combinations(2).toArray)
val result = df.select(explode(pairsUdf($"candidateList")) as "candidates")
result.show(numRows = 8)
// +----------+
// |candidates|
// +----------+
// | [1, 2]|
// | [2, 3]|
// | [2, 4]|
// | [3, 4]|
// | [1, 3]|
// | [1, 5]|
// | [3, 5]|
// | [1, 2]|
// +----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.