[英]I have dataFrame as below and want to add remarks based on the column values using Scala
以下是我的輸入
id val visits date
111 2 1 20160122
111 2 1 20170122
112 4 2 20160122
112 5 4 20150122
113 6 1 20100120
114 8 2 20150122
114 8 2 20150122
預期產量:
id val visits date remarks
111 2 1 20160122 oldDate
111 2 1 20170122 recentdate
112 4 2 20160122 less
112 5 4 20150122 more
113 6 1 20100120 one
114 8 2 20150122 Ramdom
114 8 2 20150122 Ramdom
備注應為:ID的Ramdom具有兩個具有相同值,訪問和日期的記錄:Id的訪問只有一個包含任何訪問次數的記錄。Id的訪問比其他訪問少的兩條記錄比其他記錄少。而不是一項具有不同價值和訪問次數的記錄。 最近日期ID的更多記錄具有相同的值和訪問次數,並且使用最大日期為不同的日期。
碼:
val grouped = df.groupBy("id").agg(max($"val").as("maxVal"), max($"visits").as("maxVisits"), min($"val").as("minVal"), min($"visits").as("minVisits"), count($"id").as("count"))
val remarks = functions.udf ((value: Int, visits: Int, maxValue: Int, maxVisits: Int, minValue: Int, minVisits: Int, count: Int) =>
if (count == 1) {
"One Visit"
}else if (value == maxValue && value == minValue && visits == maxVisits && visits == minVisits) {
"Random"
}else {
if (visits < maxVisits) {
"Less Visits"
}else {
"More Visits"
}
}
)
df.join(grouped, Seq("id"))
.withColumn("remarks", remarks($"val", $"visits", $"maxVal", $"maxVisits", $"minVal", $"minVisits", $"count"))
.drop("maxVal","maxVisits", "minVal", "minVisits", "count")
下面的代碼應該為您工作(但效率不高,因為還有很多其他方法)
import org.apache.spark.sql.functions._
def remarkUdf = udf((column: Seq[Row])=>{
if(column.size == 1) Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "one"))
else if(column.size == 2) {
if(column(0) == column(1)) column.map(x => remarks(x.getAs(0), x.getAs(1), x.getAs(2), "Random"))
else{
if(column(0).getAs(0) == column(1).getAs(0) && column(0).getAs(1) == column(1).getAs(1)){
if(column(0).getAs[Int](2) < column(1).getAs[Int](2)) Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "oldDate"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "recentdate"))
else Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "recentdate"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "oldDate"))
}
else{
if(column(0).getAs[Int](0) < column(1).getAs[Int](0) && column(0).getAs[Int](1) < column(1).getAs[Int](1)) {
Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "less"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "more"))
}
else Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "more"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "less"))
}
}
}
else{
column.map(x => remarks(x.getAs(0), x.getAs(1), x.getAs(2), "not defined"))
}
})
df.groupBy("id").agg(collect_list(struct("val", "visits", "date")).as("value"))
.withColumn("value", explode(remarkUdf(col("value"))))
.select(col("id"), col("value.*"))
.show(false)
它應該給你
+---+-----+------+--------+----------+
|id |value|Visits|date |Remarks |
+---+-----+------+--------+----------+
|111|2 |1 |20160122|oldDate |
|111|2 |1 |20170122|recentdate|
|112|4 |2 |20160122|less |
|112|5 |4 |20150122|more |
|114|8 |2 |20150122|Random |
|114|8 |2 |20150122|Random |
|113|6 |1 |20100120|one |
+---+-----+------+--------+----------+
並且您需要以下case class
case class remarks(value: Int, Visits: Int, date: Int, Remarks: String)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.