在Spark DataFrame中將行值轉換為列數組

Question

我正在研究spark數據幀，我需要對列進行分組，並將分組行的列值轉換為元素數組作為新列。 范例：

Input:

employee | Address
------------------
Micheal  |  NY
Micheal  |  NJ

Output:

employee | Address
------------------
Micheal  | (NY,NJ)

任何幫助都將受到高度贊賞。

Answer 1

這是一個替代解決方案，其中我已將數據幀轉換為rdd進行轉換，並使用sqlContext.createDataFrame()將其轉換回dataFrame

Sample.json

{"employee":"Michale","Address":"NY"}
{"employee":"Michale","Address":"NJ"}
{"employee":"Sam","Address":"NY"}
{"employee":"Max","Address":"NJ"}

Spark應用

val df = sqlContext.read.json("sample.json")

// Printing the original Df
df.show()

//Defining the Schema for the aggregated DataFrame
val dataSchema = new StructType(
  Array(
    StructField("employee", StringType, nullable = true),
    StructField("Address", ArrayType(StringType, containsNull = true), nullable = true)
  )
)
// Converting the df to rdd and performing the groupBy operation
val aggregatedRdd: RDD[Row] = df.rdd.groupBy(r =>
          r.getAs[String]("employee")
        ).map(row =>
          // Mapping the Grouped Values to a new Row Object
          Row(row._1, row._2.map(_.getAs[String]("Address")).toArray)
        )

// Creating a DataFrame from the aggregatedRdd with the defined Schema (dataSchema)
val aggregatedDf = sqlContext.createDataFrame(aggregatedRdd, dataSchema)

// Printing the aggregated Df
aggregatedDf.show()

輸出：

+-------+--------+---+
|Address|employee|num|
+-------+--------+---+
|     NY| Michale|  1|
|     NJ| Michale|  2|
|     NY|     Sam|  3|
|     NJ|     Max|  4|
+-------+--------+---+

+--------+--------+
|employee| Address|
+--------+--------+
|     Sam|    [NY]|
| Michale|[NY, NJ]|
|     Max|    [NJ]|
+--------+--------+

Answer 2

如果您使用的是Spark 2.0+ ，則可以使用collect_list或collect_set 。 您的查詢將類似於（假設您的數據框稱為input ）：

import org.apache.spark.sql.functions._

input.groupBy('employee).agg(collect_list('Address))

如果可以使用重復項，請使用collect_list 。 如果您collect_set重復項，並且只需要列表中的唯一項，請使用collect_set 。

希望這可以幫助！

在Spark DataFrame中將行值轉換為列數組

問題描述

2 個解決方案

解決方案1
5 2016-04-04 09:24:58

解決方案2
4 2018-05-17 17:41:14

在Spark DataFrame中將行值轉換為列數組

問題描述

2 個解決方案

解決方案1 5 2016-04-04 09:24:58

解決方案2 4 2018-05-17 17:41:14

解決方案1
5 2016-04-04 09:24:58

解決方案2
4 2018-05-17 17:41:14