简体   繁体   English

使用 Java8 的 Spark 2.3 将行转换为列

[英]Spark 2.3 with Java8 transform a row to columns

I am new to Spark 2.4 with Java 8. I need help.我是 Spark 2.4 的新手,使用 Java 8。我需要帮助。 Here is example of instances:以下是实例示例:

Source DataFrame来源 DataFrame

+--------------+
| key | Value  |
+--------------+
| A   | John   |
| B   | Nick   |
| A   | Mary   |
| B   | Kathy  |
| C   | Sabrina|
| B   | George |
+--------------+

Meta DataFrame元 DataFrame

+-----+
| key |
+-----+
| A   |
| B   |
| C   |
| D   |
| E   |
| F   |
+-----+

I would like to transform it to the following: Column names from Meta Dataframe and Rows will be transformed based on Source Dataframe我想将其转换为以下内容: Meta Dataframe 的列名和行将根据 Source Dataframe 进行转换

+-----------------------------------------------+
| A    | B      | C       | D     | E    | F    |
+-----------------------------------------------+
| John | Nick   | Sabrina | null  | null | null |
| Mary | Kathy  | null    | null  | null | null |
| null | George | null    | null  | null | null |
+-----------------------------------------------+

Need to write a code Spark 2.3 with Java8.需要用Java8写一段Spark 2.3的代码。 Appreciated your help.感谢您的帮助。

To make things clearer (and easily reproducible) let's define dataframes:为了使事情更清晰(并且易于重现),让我们定义数据框:

val df1 = Seq("A" -> "John", "B" -> "Nick", "A" -> "Mary", 
              "B" -> "Kathy", "C" -> "Sabrina", "B" -> "George")
          .toDF("key", "value")
val df2 = Seq("A", "B", "C", "D", "E", "F").toDF("key")

From what I see, you are trying to create one column by value in the key column of df2 .据我所知,您正试图在df2key列中按值创建一列。 These columns should contain all the values of the value column that are associated to the key naming the column.这些列应包含与命名列的key关联的value列的所有值。 If we take an example, column A 's first value should be the value of the first occurrence of A (if it exists, null otherwise): "John" .如果我们举个例子,列A的第一个值应该是A第一次出现的值(如果存在,则 null 否则): "John" Its second value should be the value of the second occurrence of A: "Mary" .它的第二个值应该是 A: "Mary"第二次出现的值。 There is no third value so the third value of the column should be null .没有第三个值,所以列的第三个值应该是null

I detailed it to show that we need a notion of rank of the values for each key (windowing function), and group by that notion of rank.我对其进行了详细说明,以表明我们需要每个键(窗口函数)的值的等级概念,并按该等级概念进行分组。 It would go as follows:它将 go 如下:

import org.apache.spark.sql.expressions.Window
val df1_win = df1
    .withColumn("id", monotonically_increasing_id)
    .withColumn("rank", rank() over Window.partitionBy("key").orderBy("id"))
// the id is just here to maintain the original order.

// getting the keys in df2. Add distinct if there are duplicates.
val keys = df2.collect.map(_.getAs[String](0)).sorted

// then it's just about pivoting
df1_win
    .groupBy("rank")
    .pivot("key", keys) 
    .agg(first('value))
    .orderBy("rank")
    //.drop("rank") // I keep here it for clarity
    .show()
+----+----+------+-------+----+----+----+                                       
|rank|   A|     B|      C|   D|   E|   F|
+----+----+------+-------+----+----+----+
|   1|John|  Nick|Sabrina|null|null|null|
|   2|Mary| Kathy|   null|null|null|null|
|   3|null|George|   null|null|null|null|
+----+----+------+-------+----+----+----+

Here is the very same code in Java这是 Java 中相同的代码

Dataset<Row> df1_win = df1
    .withColumn("id", functions.monotonically_increasing_id())
    .withColumn("rank", functions.rank().over(Window.partitionBy("key").orderBy("id")));
    // the id is just here to maintain the original order.

// getting the keys in df2. Add distinct if there are duplicates.
// Note that it is a list of objects, to match the (strange) signature of pivot
List<Object> keys = df2.collectAsList().stream()
    .map(x -> x.getString(0))
    .sorted().collect(Collectors.toList());

// then it's just about pivoting
df1_win
    .groupBy("rank")
    .pivot("key", keys)
    .agg(functions.first(functions.col("value")))
    .orderBy("rank")
    // .drop("rank") // I keep here it for clarity
    .show();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM