简体   繁体   English

在apache spark中选择不同的(在一列上)非空(在所有其他列上)来自Dataframe的值

[英]select distinct(On one column) not null (on all other column) value from Dataframe in apache spark

I am having below Dataframe : 我在Dataframe下面:

+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50|       2|     null|   null|
| 34|       4|     null|   null|
| 34|    null|     true|60000.0|
| 32|    null|    false|35000.0|
+---+--------+---------+-------+

I want output something like below : 我想要输出如下所示:

+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50|       2|     null|   null|
| 34|       4|     true|60000.0|
| 32|    null|    false|35000.0|
+---+--------+---------+-------+

You can see age column contain 34 was duplicate so i want to merge value for 34 row (not null value of other row) 你可以看到年龄列包含34个重复所以我想合并34行的值(不是其他行的空值)

Thanks 谢谢

If first not null in group is required, can be achived with "first" function: 如果首先在组中不为null,则可以使用“first”函数获取:

val df = Seq(
  (50, Some(2), None, None),
  (34, Some(4), None, None),
  (34, None, Some(true), Some(60000.0)),
  (32, None, Some(false), Some(35000.0))
).toDF("age", "children", "education", "income")

val result = df
  .groupBy("age")
  .agg(
    first("children", ignoreNulls = true).alias("children"),
    first("education", ignoreNulls = true).alias("education"),
    first("income", ignoreNulls = true).alias("income")
  )
result.orderBy("age").show(false)

Output: 输出:

+---+--------+---------+-------+
|age|children|education|income |
+---+--------+---------+-------+
|32 |null    |false    |35000.0|
|34 |4       |true     |60000.0|
|50 |2       |null     |null   |
+---+--------+---------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 scala/spark - 将数据框分组并从其他列中选择值作为数据框 - scala/spark - group dataframe and select values from other column as dataframe 分类字段基于Spark Dataframe中的不同值 - Categories column on the basis of distinct value in Spark Dataframe 从Spark数据框中的列生成不同的值 - Generate distinct values from a column in a spark dataframe Spark Scala:从其他数据框中选择列名称 - Spark scala : select column name from other dataframe #SPARK #需要从spark Scala中的其他dataframe列分配dataframe列值 - #SPARK #Need to assign dataframe column value from other dataframe column in spark Scala 如何从 spark dataframe 中删除特定列,然后删除 select 所有列 - How to drop specific column and then select all columns from spark dataframe 在Spark Scala的dataframe列中过滤NULL值 - Filter NULL value in dataframe column of spark scala Apache Spark 根据列的不同值计算列值 - Apache Spark calculating column value on the basis of distinct value of columns Spark-Scala:从列数据框中选择不同的数组,而忽略顺序 - Spark-scala: Select distinct arrays from a column dataframe ignoring ordering 添加一列火花 dataframe 包含当前行的所有列名的列表,其值不是 null - Add a column to spark dataframe which contains list of all column names of the current row whose value is not null
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM