簡體   English   中英

使用 spark 數據框進行分組時,獲取帶條件的列的第一個值

[英]Get first value of column with condition when group by use spark dataframe

首先,如果我的英語不好,我很抱歉。 我是火花初學者。 我有一個數據框“原始”:

+------------------------+----+------------------------+---+------+
|id                      |name|phone                   |sex|source|
+------------------------+----+------------------------+---+------+
|gEzIl5K+6n6GPLD0pAQKFA==|alex|na                      |M  |1     |
|gEzIl5K+6n6GPLD0pAQKFA==|alex|+Uy8Ol77OWiSuuapn5FOUg==|na |2     |
+------------------------+----+------------------------+---+------+

'na':字符串默認值來源:優先級,1 > 2

我期望結果:

+------------------------+----+------------------------+---+------+
|id                      |name|phone                   |sex|source|
+------------------------+----+------------------------+---+------+
|gEzIl5K+6n6GPLD0pAQKFA==|alex|+Uy8Ol77OWiSuuapn5FOUg==|M  |1     |
+------------------------+----+------------------------+---+------+

我試過:

val rs = raw.orderBy(source)
        .groupBy(col("id"))
        .agg(first(when(col("phone") === "na" || col("phone") === ""
      , col("phone"))).as("phone")
        , first(when(col("sex") === "na" || col("sex") === ""
      , col("sex"))).as("sex")
        , first(when(col("source") === "na" || col("source") === ""
      , col("source"))).as("source")
)

但不是真的。 希望得到大家的幫助。 萬分感謝!

試試這個。

df.orderBy("source")
  .groupBy(col("id"))
  .agg(min(when(!'phone.isin("na",""), 'phone)).as("phone"),
    min(when(!'sex.isin("na",""),'sex)).as("sex"),
    min(when(!'source.isin("na",""), 'source)).as("source"))
  .show()

+--------------------+--------------------+---+------+
|                  id|               phone|sex|source|
+--------------------+--------------------+---+------+
|gEzIl5K+6n6GPLD0p...|+Uy8Ol77OWiSuuapn...|  M|     1|
+--------------------+--------------------+---+------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM