简体   繁体   English

如何使用Scala添加备注列

[英]How to add remarks column using scala

I have dataFrame as below and want to add remarks using Scala 我有如下dataFrame ,并想使用Scala添加备注

id    val  visits 
111   2        1
111   2        1
112   4        2
112   5        4
113   6        1

Expected output should be below 预期输出应低于

id    val    visits   remarks
111   2        1      Ramdom
111   2        1      Ramdom
112   4        2      Less visit
112   5        4      More visit
113   6        1      One visit

Remarks should be: 备注应为:
Ramdom for Id has two records with same value & visits Ramdom for Id拥有两条具有相同价值和访问次数的记录
One Visit for Id has only one record which contains any no of visits ID的一次访问只有一个记录,其中包含任何访问次数
Less Visit for Id has two records with less visits compared to other Id的减访问有两个记录用更少的访问相比其他
More Visit for Id has more than one record with different value and visits. “ Id的更多访问次数 ”具有多个具有不同价值和访问次数的记录。

May not be the best solution but it's a working one: 可能不是最好的解决方案,但这是一个可行的解决方案:

First group your dataFrame by val and visits and the count of them 首先将您的dataFramevalvisits及其计数进行分组

val grouped = df.groupBy("id").agg(max($"val").as("maxVal"), max($"visits").as("maxVisits"), min($"val").as("minVal"), min($"visits").as("minVisits"), count($"id").as("count"))

Then define a UDF where you implement your logic: 然后定义一个实现您的逻辑的UDF

val remarks = functions.udf ((value: Int, visits: Int, maxValue: Int, maxVisits: Int, minValue: Int, minVisits: Int, count: Int) =>
   if (count == 1) {
     "One Visit"
   }else if (value == maxValue && value == minValue && visits == maxVisits && visits == minVisits) {
     "Random"
   }else {
     if (visits < maxVisits) {
       "Less Visits"
     }else {
       "More Visits"
     }
   }
 )

Then join the original dataFrame and the grouped one by id and add the desired column with the UDF . 然后将原始dataFrame和按id分组的一组加入,并用UDF添加所需的列。 Finally remove the undesired columns from the output: 最后,从输出中删除不需要的列:

 df.join(grouped, Seq("id"))
   .withColumn("remarks", remarks($"val", $"visits", $"maxVal", $"maxVisits", $"minVal", $"minVisits", $"count"))
   .drop("maxVal","maxVisits", "minVal", "minVisits", "count")

Output: 输出:

+---+----+-------+-----------+
| id| val| visits|    remarks|
+---+----+-------+-----------+
|112|   4|      2|Less Visits|
|112|   5|      4|More Visits|
|113|   6|      1|  One Visit|
|111|   2|      1|     Random|
|111|   2|      1|     Random|
+---+----+-------+-----------+

PS remember to import functions PS记得导入功能

import org.apache.spark.sql.functions

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我有如下dataFrame,并想使用Scala基于列值添加注释 - I have dataFrame as below and want to add remarks based on the column values using Scala 如何使用 scala 中的 withColumn function 添加可变列表作为 dataframe 的列 - How to add a mutable list as a column of a dataframe using withColumn function in scala 我们如何使用 Scala 在 spark 中添加列值? - How do we add column value in spark using Scala? 如何使用 Scala 在 DataFrame 中添加新的可为空字符串列 - How to add a new nullable String column in a DataFrame using Scala 使用 Scala 将时间戳列添加到 RDD - Add timestamp column to RDD using Scala Scala Spark,如何为列添加值 - Scala Spark, how to add value to the column 如何使用Scala / Spark 2.2将列添加到现有DataFrame并使用window函数在新列中添加特定行 - How to add a column to the existing DataFrame and using window function to add specific rows in the new column using Scala/Spark 2.2 如何在 scala/python 中将计算列添加到 dataframe? - how to add a calculated column to a dataframe in scala/python? 如何基于Spark Scala中的现有列添加新列 - How add new column based on existing column in spark scala Scala数据框:如何使用两个数据框之间的条件向数据框添加列? - Scala Dataframe : How can I add a column to a Dataframe using a condition between two Dataframes?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM