简体   繁体   English

提取列值并将其作为Spark数据帧中的数组分配给另一列

[英]Extract a column value and assign it to another column as an array in Spark dataframe

I have a Spark Dataframe with the below columns. 我有一个带有以下列的Spark Dataframe。

C1 | C2 | C3| C4
1  | 2  | 3 | S1
2  | 3  | 3 | S2
4  | 5  | 3 | S2

I want to generate another column C5 by taking distinct values from column C4 like C5 我想通过像C5这样从C4列中获取不同的值来生成另一列C5

[S1,S2]
[S1,S2]
[S1,S2]

Can somebody help me how to achieve this in Spark data frame using Scala? 有人可以帮助我如何使用Scala在Spark数据框中实现此目标吗?

You might want to collect the distinct items from column 4 and put them in a List firstly, and then use withColumn to create a new column C5 by creating a udf that always return a constant list: 您可能需要收集来自柱4的不同项目,并把它们放在一个列表先,然后用withColumn创建一个新的列C5通过创建udf总是返回一个恒定的名单:

val uniqueVal = df.select("C4").distinct().map(x => x.getAs[String](0)).collect.toList    
def myfun: String => List[String] = _ => uniqueVal 
def myfun_udf = udf(myfun)

df.withColumn("C5", myfun_udf(col("C4"))).show

+---+---+---+---+--------+
| C1| C2| C3| C4|      C5|
+---+---+---+---+--------+
|  1|  2|  3| S1|[S2, S1]|
|  2|  3|  3| S2|[S2, S1]|
|  4|  5|  3| S2|[S2, S1]|
+---+---+---+---+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从火花数据框中提取列值并将其添加到另一个数据框中 - Extract a column value from a spark dataframe and add it to another dataframe 参考另一个数组列的 Spark 数据帧聚合 - Spark dataframe aggregation with reference to another array column 使用 Spark Dataframe (Scala) 中的另一列数组创建一列数组 - Creating a column of array using another column of array in a Spark Dataframe (Scala) 如何根据火花DataFrame中另一列的值更改一列的值 - How to change the value of a column according to the value of another column in a spark DataFrame #SPARK #需要从spark Scala中的其他dataframe列分配dataframe列值 - #SPARK #Need to assign dataframe column value from other dataframe column in spark Scala Spark数据帧:根据另一列的值提取列 - Spark dataframes: Extract a column based on the value of another column 如何根据另一列的值填充 Spark DataFrame 列? - How to populate a Spark DataFrame column based on another column's value? 基于另一列更新 spark dataframe 中的列值 - Update a column value in a spark dataframe based another column 检索 spark dataframe 数组列值并将其用作 UDF 中的列名 - Retrieve spark dataframe array column value and reuse it as a column name in a UDF 如何提取列值以与火花 dataframe 中的 rlike 进行比较 - How to extract column value to compare with rlike in spark dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM