[英]Converting row values into a column array in spark dataframe
我正在研究spark數據幀,我需要對列進行分組,並將分組行的列值轉換為元素數組作為新列。 范例:
Input:
employee | Address
------------------
Micheal | NY
Micheal | NJ
Output:
employee | Address
------------------
Micheal | (NY,NJ)
任何幫助都將受到高度贊賞。
這是一個替代解決方案,其中我已將數據幀轉換為rdd進行轉換,並使用sqlContext.createDataFrame()
將其轉換回dataFrame
Sample.json
{"employee":"Michale","Address":"NY"}
{"employee":"Michale","Address":"NJ"}
{"employee":"Sam","Address":"NY"}
{"employee":"Max","Address":"NJ"}
Spark應用
val df = sqlContext.read.json("sample.json")
// Printing the original Df
df.show()
//Defining the Schema for the aggregated DataFrame
val dataSchema = new StructType(
Array(
StructField("employee", StringType, nullable = true),
StructField("Address", ArrayType(StringType, containsNull = true), nullable = true)
)
)
// Converting the df to rdd and performing the groupBy operation
val aggregatedRdd: RDD[Row] = df.rdd.groupBy(r =>
r.getAs[String]("employee")
).map(row =>
// Mapping the Grouped Values to a new Row Object
Row(row._1, row._2.map(_.getAs[String]("Address")).toArray)
)
// Creating a DataFrame from the aggregatedRdd with the defined Schema (dataSchema)
val aggregatedDf = sqlContext.createDataFrame(aggregatedRdd, dataSchema)
// Printing the aggregated Df
aggregatedDf.show()
輸出:
+-------+--------+---+
|Address|employee|num|
+-------+--------+---+
| NY| Michale| 1|
| NJ| Michale| 2|
| NY| Sam| 3|
| NJ| Max| 4|
+-------+--------+---+
+--------+--------+
|employee| Address|
+--------+--------+
| Sam| [NY]|
| Michale|[NY, NJ]|
| Max| [NJ]|
+--------+--------+
如果您使用的是Spark 2.0+ ,則可以使用collect_list
或collect_set
。 您的查詢將類似於(假設您的數據框稱為input ):
import org.apache.spark.sql.functions._
input.groupBy('employee).agg(collect_list('Address))
如果可以使用重復項,請使用collect_list
。 如果您collect_set
重復項,並且只需要列表中的唯一項,請使用collect_set
。
希望這可以幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.