在apache spark中选择不同的（在一列上）非空（在所有其他列上）来自Dataframe的值

Question

I am having below Dataframe : 我在Dataframe下面：

+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50|       2|     null|   null|
| 34|       4|     null|   null|
| 34|    null|     true|60000.0|
| 32|    null|    false|35000.0|
+---+--------+---------+-------+

I want output something like below : 我想要输出如下所示：

+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50|       2|     null|   null|
| 34|       4|     true|60000.0|
| 32|    null|    false|35000.0|
+---+--------+---------+-------+

You can see age column contain 34 was duplicate so i want to merge value for 34 row (not null value of other row) 你可以看到年龄列包含34个重复所以我想合并34行的值（不是其他行的空值）

Thanks 谢谢

Answer 1

If first not null in group is required, can be achived with "first" function: 如果首先在组中不为null，则可以使用“first”函数获取：

val df = Seq(
  (50, Some(2), None, None),
  (34, Some(4), None, None),
  (34, None, Some(true), Some(60000.0)),
  (32, None, Some(false), Some(35000.0))
).toDF("age", "children", "education", "income")

val result = df
  .groupBy("age")
  .agg(
    first("children", ignoreNulls = true).alias("children"),
    first("education", ignoreNulls = true).alias("education"),
    first("income", ignoreNulls = true).alias("income")
  )
result.orderBy("age").show(false)

Output: 输出：

+---+--------+---------+-------+
|age|children|education|income |
+---+--------+---------+-------+
|32 |null    |false    |35000.0|
|34 |4       |true     |60000.0|
|50 |2       |null     |null   |
+---+--------+---------+-------+

在apache spark中选择不同的（在一列上）非空（在所有其他列上）来自Dataframe的值

问题描述

1 个解决方案

解决方案1
0 2019-04-19 15:29:53

在apache spark中选择不同的（在一列上）非空（在所有其他列上）来自Dataframe的值

问题描述

1 个解决方案

解决方案1 0 2019-04-19 15:29:53

解决方案1
0 2019-04-19 15:29:53