使用 Java8 的 Spark 2.3 将行转换为列

Question

我是 Spark 2.4 的新手，使用 Java 8。我需要帮助。 以下是实例示例：

来源 DataFrame

+--------------+
| key | Value  |
+--------------+
| A   | John   |
| B   | Nick   |
| A   | Mary   |
| B   | Kathy  |
| C   | Sabrina|
| B   | George |
+--------------+

元 DataFrame

+-----+
| key |
+-----+
| A   |
| B   |
| C   |
| D   |
| E   |
| F   |
+-----+

我想将其转换为以下内容： Meta Dataframe 的列名和行将根据 Source Dataframe 进行转换

+-----------------------------------------------+
| A    | B      | C       | D     | E    | F    |
+-----------------------------------------------+
| John | Nick   | Sabrina | null  | null | null |
| Mary | Kathy  | null    | null  | null | null |
| null | George | null    | null  | null | null |
+-----------------------------------------------+

需要用Java8写一段Spark 2.3的代码。 感谢您的帮助。

Answer 1

为了使事情更清晰（并且易于重现），让我们定义数据框：

val df1 = Seq("A" -> "John", "B" -> "Nick", "A" -> "Mary", 
              "B" -> "Kathy", "C" -> "Sabrina", "B" -> "George")
          .toDF("key", "value")
val df2 = Seq("A", "B", "C", "D", "E", "F").toDF("key")

据我所知，您正试图在df2的key列中按值创建一列。 这些列应包含与命名列的key关联的value列的所有值。 如果我们举个例子，列A的第一个值应该是A第一次出现的值（如果存在，则 null 否则）： "John" 。 它的第二个值应该是 A: "Mary"第二次出现的值。 没有第三个值，所以列的第三个值应该是null 。

我对其进行了详细说明，以表明我们需要每个键（窗口函数）的值的等级概念，并按该等级概念进行分组。 它将 go 如下：

import org.apache.spark.sql.expressions.Window
val df1_win = df1
    .withColumn("id", monotonically_increasing_id)
    .withColumn("rank", rank() over Window.partitionBy("key").orderBy("id"))
// the id is just here to maintain the original order.

// getting the keys in df2. Add distinct if there are duplicates.
val keys = df2.collect.map(_.getAs[String](0)).sorted

// then it's just about pivoting
df1_win
    .groupBy("rank")
    .pivot("key", keys) 
    .agg(first('value))
    .orderBy("rank")
    //.drop("rank") // I keep here it for clarity
    .show()
+----+----+------+-------+----+----+----+                                       
|rank|   A|     B|      C|   D|   E|   F|
+----+----+------+-------+----+----+----+
|   1|John|  Nick|Sabrina|null|null|null|
|   2|Mary| Kathy|   null|null|null|null|
|   3|null|George|   null|null|null|null|
+----+----+------+-------+----+----+----+

这是 Java 中相同的代码

Dataset<Row> df1_win = df1
    .withColumn("id", functions.monotonically_increasing_id())
    .withColumn("rank", functions.rank().over(Window.partitionBy("key").orderBy("id")));
    // the id is just here to maintain the original order.

// getting the keys in df2. Add distinct if there are duplicates.
// Note that it is a list of objects, to match the (strange) signature of pivot
List<Object> keys = df2.collectAsList().stream()
    .map(x -> x.getString(0))
    .sorted().collect(Collectors.toList());

// then it's just about pivoting
df1_win
    .groupBy("rank")
    .pivot("key", keys)
    .agg(functions.first(functions.col("value")))
    .orderBy("rank")
    // .drop("rank") // I keep here it for clarity
    .show();

使用 Java8 的 Spark 2.3 将行转换为列

问题描述

1 个解决方案

解决方案1
3 2019-09-26 09:48:21

使用 Java8 的 Spark 2.3 将行转换为列

问题描述

1 个解决方案

解决方案1 3 2019-09-26 09:48:21

解决方案1
3 2019-09-26 09:48:21