![](/img/trans.png)
[英]scala/spark - group dataframe and select values from other column as dataframe
[英]select distinct(On one column) not null (on all other column) value from Dataframe in apache spark
我在Dataframe下面:
+---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 50| 2| null| null| | 34| 4| null| null| | 34| null| true|60000.0| | 32| null| false|35000.0| +---+--------+---------+-------+
我想要輸出如下所示:
+---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 50| 2| null| null| | 34| 4| true|60000.0| | 32| null| false|35000.0| +---+--------+---------+-------+
你可以看到年齡列包含34個重復所以我想合並34行的值(不是其他行的空值)
謝謝
如果首先在組中不為null,則可以使用“first”函數獲取:
val df = Seq(
(50, Some(2), None, None),
(34, Some(4), None, None),
(34, None, Some(true), Some(60000.0)),
(32, None, Some(false), Some(35000.0))
).toDF("age", "children", "education", "income")
val result = df
.groupBy("age")
.agg(
first("children", ignoreNulls = true).alias("children"),
first("education", ignoreNulls = true).alias("education"),
first("income", ignoreNulls = true).alias("income")
)
result.orderBy("age").show(false)
輸出:
+---+--------+---------+-------+
|age|children|education|income |
+---+--------+---------+-------+
|32 |null |false |35000.0|
|34 |4 |true |60000.0|
|50 |2 |null |null |
+---+--------+---------+-------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.