[英]Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala
[英]Scala Spark DataFrame Question:How to add new columns by matching the value in current row to somewhere from previous rows
我需要有关 Spark Dataframe 的帮助。
原始的 dataframe df 如下所示:
+---+---------+-------+---+------+
| id| tim| price|qty|qtyChg|
+---+---------+-------+---+------+
| 1|31951.509| 0.370| 1| 1|
| 2|31951.515|145.380|100| 100|
| 3|31951.519|149.370|100| 100|
| 4|31951.520|144.370|100| 100|
| 5|31951.520|149.370|300| 200|
| 6|31951.520|119.370| 5| 5|
| 7|31951.521|149.370|400| 100|
| 8|31951.522|109.370| 50| 50|
| 9|31951.522|149.370|410| 10|
| 10|31951.522|144.370|400| 300|
| 11|31951.522|149.870| 50| 50|
| 12|31951.524|149.370|610| 200|
| 13|31951.526|135.130| 22| 22|
| 14|31951.527|149.370|750| 140|
| 15|31951.528| 89.370|100| 100|
| 16|31951.528|145.870| 50| 50|
| 17|31951.528|139.370|100| 100|
| 18|31951.531|149.370|769| 19|
| 19|31951.531|144.370|410| 10|
| 20|31951.538|149.370|869| 100|
+---+---------+-------+---+------+
我通过代码添加两列top1price
和top2price
val ww = Window.partitionBy().orderBy($"tim")
val newdf = df.withColumn("sequence",sort_array(collect_set(col("price")).over(ww),asc=false))
.withColumn("top1price",col("sequence").getItem(0))
.withColumn("top2price",col("sequence").getItem(1)).drop("sequence")
newdf 看起来像这样:
+---+---------+-------+---+------+---------+---------+
| id| tim| price|qty|qtyChg|top1price|top2price|
+---+---------+-------+---+------+---------+---------+
| 1|31951.509| 0.370| 1| 1| 0.370| null|
| 2|31951.515|145.380|100| 100| 145.380| 0.370|
| 3|31951.519|149.370|100| 100| 149.370| 145.380|
| 4|31951.520|119.370| 5| 5| 149.370| 145.380|
| 5|31951.520|144.370|100| 100| 149.370| 145.380|
| 6|31951.520|149.370|300| 200| 149.370| 145.380|
| 7|31951.521|149.370|400| 100| 149.370| 145.380|
| 8|31951.522|109.370| 50| 50| 149.870| 149.370|
| 9|31951.522|144.370|400| 300| 149.870| 149.370|
| 10|31951.522|149.370|410| 10| 149.870| 149.370|
| 11|31951.522|149.870| 50| 50| 149.870| 149.370|
| 12|31951.524|149.370|610| 200| 149.870| 149.370|
| 13|31951.526|135.130| 22| 22| 149.870| 149.370|
| 14|31951.527|149.370|750| 140| 149.870| 149.370|
| 15|31951.528| 89.370|100| 100| 149.870| 149.370|
| 16|31951.528|139.370|100| 100| 149.870| 149.370|
| 17|31951.528|145.870| 50| 50| 149.870| 149.370|
| 18|31951.531|144.370|410| 10| 149.870| 149.370|
| 19|31951.531|149.370|769| 19| 149.870| 149.370|
| 20|31951.538|144.880|200| 200| 149.870| 149.370|
+---+---------+-------+---+------+---------+---------+
top1price 背后的逻辑是迄今为止每时每刻的最高价格。 例如,在时间 31951.520 ( id = 6
),到那时的最高价格是 149.370,它来自行id=6
。 那时第二高的价格是来自 row2 的 145.38。 我有兴趣再添加两列top1priceQty
和top2priceQty
。 同样的例子,在第 6 行,最高价格为 149.370,其对应数量为 300,也来自第 6 行top2price
为 145.380,其top2priceQty
为 100,也来自第 2 行。
| 8|31951.522|109.370| 50| 50| 149.870| 149.370|
对于第 8 行, top1price
是 149.870,它来自第 11 行,因为第 8 行到第 11 行是同一时刻。 所以到那时,149.870 是最高价格,其对应的top1priceQty
将是top2price
是 149.370,来自第 7 行,因此对应的top2priceQty
是 400,也来自第 7 行。
先感谢您!
这是一个有趣的问题。 我想在这里提几点。
// 让我们创建一个示例 DataFrame
// Lets create a sample DataFrame
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df = Seq((1, 31951.509, 0.370, 1, 1),
(2, 31951.515, 145.380, 100,100),
(3, 31951.519, 149.370, 100, 100),
(4, 31951.520, 144.370, 100, 100),
(5, 31951.520, 149.370, 300, 200),
(6, 31951.520, 119.370, 5, 5))
.toDF("id", "tim", "price", "qty", "qtyChg")
.orderBy("id")
// Define the window specification which starts from beginning (specified by "Window.unboundedPreceding") and and end at current row (specified by value 0).
val winSpec = Window.orderBy("tim").rowsBetween(Window.unboundedPreceding, 0)
// Collect all the values and sort them in descending order.
val df1 = df.withColumn("sort_array", sort_array(collect_list(struct("price", "qty")).over(winSpec), asc=false))
// Fectch the elements at position 1 and 2 which represent the max and second max value.
val df2 = df1//.withColumn("top1price", element_at(sort_array(array_distinct($"sort_array.price"), asc=false), 1))
//.withColumn("top2price", element_at(sort_array(array_distinct($"sort_array.price"), asc=false), 2))
.withColumn("top1price", element_at(array_distinct($"sort_array.price"), 1))
.withColumn("top2price", element_at(array_distinct($"sort_array.price"), 2))
.withColumn("top1priceQty", element_at($"sort_array.qty", 1))
.withColumn("top2priceQty", element_at($"sort_array.qty", 2))
.drop("sort_array")
// Display the result.
df2.show(truncate= false)
// Output
+---+---------+------+---+------+---------+---------+------------+------------+
|id |tim |price |qty|qtyChg|top1price|top2price|top1priceQty|top2priceQty|
+---+---------+------+---+------+---------+---------+------------+------------+
|1 |31951.509|0.37 |1 |1 |0.37 |null |1 |null |
|2 |31951.515|145.38|100|100 |145.38 |0.37 |100 |1 |
|3 |31951.519|149.37|100|100 |149.37 |145.38 |100 |100 |
|4 |31951.52 |144.37|100|100 |149.37 |145.38 |100 |100 |
|5 |31951.52 |149.37|300|200 |149.37 |145.38 |300 |100 |
|6 |31951.52 |119.37|5 |5 |149.37 |145.38 |300 |100 |
+---+---------+------+---+------+---------+---------+------------+------------+
我希望这有帮助。
我尝试使用以下方法解决此问题。
请注意:
我没有进行任何性能测试,因此在根据要求使用一些数据变化进行一些性能实验后使用它。 以下解决方案将严重影响性能,因为windowing
中没有partitionBy
子句
请检查 output,看看它是否满足您的要求。 我还没有逐行匹配。 但我认为它应该工作。
val data =
"""
|id| tim| price|qty|qtyChg
| 1|31951.509| 0.370| 1| 1
| 2|31951.515|145.380|100| 100
| 3|31951.519|149.370|100| 100
| 4|31951.520|144.370|100| 100
| 5|31951.520|149.370|300| 200
| 6|31951.520|119.370| 5| 5
| 7|31951.521|149.370|400| 100
| 8|31951.522|109.370| 50| 50
| 9|31951.522|149.370|410| 10
|10|31951.522|144.370|400| 300
|11|31951.522|149.870| 50| 50
|12|31951.524|149.370|610| 200
|13|31951.526|135.130| 22| 22
|14|31951.527|149.370|750| 140
|15|31951.528| 89.370|100| 100
|16|31951.528|145.870| 50| 50
|17|31951.528|139.370|100| 100
|18|31951.531|149.370|769| 19
|19|31951.531|144.370|410| 10
|20|31951.538|149.370|869| 100
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
df.show(false)
df.printSchema()
输出-
+---+---------+------+---+------+
|id |tim |price |qty|qtyChg|
+---+---------+------+---+------+
|1 |31951.509|0.37 |1 |1 |
|2 |31951.515|145.38|100|100 |
|3 |31951.519|149.37|100|100 |
|4 |31951.52 |144.37|100|100 |
|5 |31951.52 |149.37|300|200 |
|6 |31951.52 |119.37|5 |5 |
|7 |31951.521|149.37|400|100 |
|8 |31951.522|109.37|50 |50 |
|9 |31951.522|149.37|410|10 |
|10 |31951.522|144.37|400|300 |
|11 |31951.522|149.87|50 |50 |
|12 |31951.524|149.37|610|200 |
|13 |31951.526|135.13|22 |22 |
|14 |31951.527|149.37|750|140 |
|15 |31951.528|89.37 |100|100 |
|16 |31951.528|145.87|50 |50 |
|17 |31951.528|139.37|100|100 |
|18 |31951.531|149.37|769|19 |
|19 |31951.531|144.37|410|10 |
|20 |31951.538|149.37|869|100 |
+---+---------+------+---+------+
root
|-- id: integer (nullable = true)
|-- tim: double (nullable = true)
|-- price: double (nullable = true)
|-- qty: integer (nullable = true)
|-- qtyChg: integer (nullable = true)
val w = Window.orderBy("tim").rangeBetween(Window.unboundedPreceding, Window.currentRow)
val w1 = Window.orderBy("tim")
val processedDF = df.withColumn("maxPriceQty", max(struct(col("price"), col("qty"))).over(w))
.withColumn("secondMaxPriceQty", lag(col("maxPriceQty"), 1).over(w1))
.withColumn("top1price", col("maxPriceQty.price"))
.withColumn("top1priceQty", col("maxPriceQty.qty"))
.withColumn("top2price", col("secondMaxPriceQty.price"))
.withColumn("top2priceQty", col("secondMaxPriceQty.qty"))
processedDF.show(false)
输出-
+---+---------+------+---+------+-------------+-----------------+---------+------------+---------+------------+
|id |tim |price |qty|qtyChg|maxPriceQty |secondMaxPriceQty|top1price|top1priceQty|top2price|top2priceQty|
+---+---------+------+---+------+-------------+-----------------+---------+------------+---------+------------+
|1 |31951.509|0.37 |1 |1 |[0.37, 1] |null |0.37 |1 |null |null |
|2 |31951.515|145.38|100|100 |[145.38, 100]|[0.37, 1] |145.38 |100 |0.37 |1 |
|3 |31951.519|149.37|100|100 |[149.37, 100]|[145.38, 100] |149.37 |100 |145.38 |100 |
|4 |31951.52 |144.37|100|100 |[149.37, 300]|[149.37, 100] |149.37 |300 |149.37 |100 |
|5 |31951.52 |149.37|300|200 |[149.37, 300]|[149.37, 300] |149.37 |300 |149.37 |300 |
|6 |31951.52 |119.37|5 |5 |[149.37, 300]|[149.37, 300] |149.37 |300 |149.37 |300 |
|7 |31951.521|149.37|400|100 |[149.37, 400]|[149.37, 300] |149.37 |400 |149.37 |300 |
|8 |31951.522|109.37|50 |50 |[149.87, 50] |[149.37, 400] |149.87 |50 |149.37 |400 |
|9 |31951.522|149.37|410|10 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|10 |31951.522|144.37|400|300 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|11 |31951.522|149.87|50 |50 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|12 |31951.524|149.37|610|200 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|13 |31951.526|135.13|22 |22 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|14 |31951.527|149.37|750|140 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|15 |31951.528|89.37 |100|100 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|16 |31951.528|145.87|50 |50 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|17 |31951.528|139.37|100|100 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|18 |31951.531|149.37|769|19 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|19 |31951.531|144.37|410|10 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
|20 |31951.538|149.37|869|100 |[149.87, 50] |[149.87, 50] |149.87 |50 |149.87 |50 |
+---+---------+------+---+------+-------------+-----------------+---------+------------+---------+------------+
请注意最后 4 个 output 列,以及有关 rowsBetween 和 rangeBetween 的说明,请点击此链接
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.