簡體   English   中英

我有如下dataFrame,並想使用Scala基於列值添加注釋

[英]I have dataFrame as below and want to add remarks based on the column values using Scala

以下是我的輸入

id    val  visits  date
111   2        1   20160122
111   2        1   20170122
112   4        2   20160122
112   5        4   20150122
113   6        1   20100120
114   8        2   20150122
114   8        2   20150122

預期產量:

id    val  visits  date        remarks
111   2        1   20160122    oldDate
111   2        1   20170122    recentdate
112   4        2   20160122    less
112   5        4   20150122    more
113   6        1   20100120    one
114   8        2   20150122    Ramdom
114   8        2   20150122    Ramdom

備注應為:ID的Ramdom具有兩個具有相同值,訪問和日期的記錄:Id的訪問只有一個包含任何訪問次數的記錄。Id的訪問比其他訪問少的兩條記錄比其他記錄少。而不是一項具有不同價值和訪問次數的記錄。 最近日期ID的更多記錄具有相同的值和訪問次數,並且使用最大日期為不同的日期。

碼:

val grouped = df.groupBy("id").agg(max($"val").as("maxVal"), max($"visits").as("maxVisits"), min($"val").as("minVal"), min($"visits").as("minVisits"), count($"id").as("count"))

val remarks = functions.udf ((value: Int, visits: Int, maxValue: Int, maxVisits: Int, minValue: Int, minVisits: Int, count: Int) =>
   if (count == 1) {
     "One Visit"
   }else if (value == maxValue && value == minValue && visits == maxVisits && visits == minVisits) {
     "Random"
   }else {
     if (visits < maxVisits) {
       "Less Visits"
     }else {
       "More Visits"
     }
   }
 )



df.join(grouped, Seq("id"))
   .withColumn("remarks", remarks($"val", $"visits", $"maxVal", $"maxVisits", $"minVal", $"minVisits", $"count"))
   .drop("maxVal","maxVisits", "minVal", "minVisits", "count")

下面的代碼應該為您工作(但效率不高,因為還有很多其他方法)

import org.apache.spark.sql.functions._
def remarkUdf = udf((column: Seq[Row])=>{
  if(column.size == 1) Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "one"))
  else if(column.size == 2) {
    if(column(0) == column(1)) column.map(x => remarks(x.getAs(0), x.getAs(1), x.getAs(2), "Random"))
    else{
      if(column(0).getAs(0) == column(1).getAs(0) && column(0).getAs(1) == column(1).getAs(1)){
        if(column(0).getAs[Int](2) < column(1).getAs[Int](2)) Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "oldDate"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "recentdate"))
        else Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "recentdate"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "oldDate"))
      }
      else{
        if(column(0).getAs[Int](0) < column(1).getAs[Int](0) && column(0).getAs[Int](1) < column(1).getAs[Int](1)) {
          Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "less"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "more"))
        }
        else Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "more"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "less"))
      }
    }
  }
  else{
    column.map(x => remarks(x.getAs(0), x.getAs(1), x.getAs(2), "not defined"))
  }

})

df.groupBy("id").agg(collect_list(struct("val", "visits", "date")).as("value"))
  .withColumn("value", explode(remarkUdf(col("value"))))
  .select(col("id"), col("value.*"))
  .show(false)

它應該給你

+---+-----+------+--------+----------+
|id |value|Visits|date    |Remarks   |
+---+-----+------+--------+----------+
|111|2    |1     |20160122|oldDate   |
|111|2    |1     |20170122|recentdate|
|112|4    |2     |20160122|less      |
|112|5    |4     |20150122|more      |
|114|8    |2     |20150122|Random    |
|114|8    |2     |20150122|Random    |
|113|6    |1     |20100120|one       |
+---+-----+------+--------+----------+

並且您需要以下case class

case class remarks(value: Int, Visits: Int, date: Int, Remarks: String)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM