Scala Spark DataFrame 问题：如何通过将当前行中的值与前一行中的某处匹配来添加新列

Question

我需要有关 Spark Dataframe 的帮助。

原始的 dataframe df 如下所示：

+---+---------+-------+---+------+
| id|      tim|  price|qty|qtyChg|
+---+---------+-------+---+------+
|  1|31951.509|  0.370|  1|     1|
|  2|31951.515|145.380|100|   100|
|  3|31951.519|149.370|100|   100|
|  4|31951.520|144.370|100|   100|
|  5|31951.520|149.370|300|   200|
|  6|31951.520|119.370|  5|     5|
|  7|31951.521|149.370|400|   100|
|  8|31951.522|109.370| 50|    50|
|  9|31951.522|149.370|410|    10|
| 10|31951.522|144.370|400|   300|
| 11|31951.522|149.870| 50|    50|
| 12|31951.524|149.370|610|   200|
| 13|31951.526|135.130| 22|    22|
| 14|31951.527|149.370|750|   140|
| 15|31951.528| 89.370|100|   100|
| 16|31951.528|145.870| 50|    50|
| 17|31951.528|139.370|100|   100|
| 18|31951.531|149.370|769|    19|
| 19|31951.531|144.370|410|    10|
| 20|31951.538|149.370|869|   100|
+---+---------+-------+---+------+

我通过代码添加两列top1price和top2price

val ww = Window.partitionBy().orderBy($"tim") 
val newdf = df.withColumn("sequence",sort_array(collect_set(col("price")).over(ww),asc=false))
                .withColumn("top1price",col("sequence").getItem(0))
                .withColumn("top2price",col("sequence").getItem(1)).drop("sequence")

newdf 看起来像这样：

+---+---------+-------+---+------+---------+---------+
| id|      tim|  price|qty|qtyChg|top1price|top2price|
+---+---------+-------+---+------+---------+---------+
|  1|31951.509|  0.370|  1|     1|    0.370|     null|
|  2|31951.515|145.380|100|   100|  145.380|    0.370|
|  3|31951.519|149.370|100|   100|  149.370|  145.380|
|  4|31951.520|119.370|  5|     5|  149.370|  145.380|
|  5|31951.520|144.370|100|   100|  149.370|  145.380|
|  6|31951.520|149.370|300|   200|  149.370|  145.380|
|  7|31951.521|149.370|400|   100|  149.370|  145.380|
|  8|31951.522|109.370| 50|    50|  149.870|  149.370|
|  9|31951.522|144.370|400|   300|  149.870|  149.370|
| 10|31951.522|149.370|410|    10|  149.870|  149.370|
| 11|31951.522|149.870| 50|    50|  149.870|  149.370|
| 12|31951.524|149.370|610|   200|  149.870|  149.370|
| 13|31951.526|135.130| 22|    22|  149.870|  149.370|
| 14|31951.527|149.370|750|   140|  149.870|  149.370|
| 15|31951.528| 89.370|100|   100|  149.870|  149.370|
| 16|31951.528|139.370|100|   100|  149.870|  149.370|
| 17|31951.528|145.870| 50|    50|  149.870|  149.370|
| 18|31951.531|144.370|410|    10|  149.870|  149.370|
| 19|31951.531|149.370|769|    19|  149.870|  149.370|
| 20|31951.538|144.880|200|   200|  149.870|  149.370|
+---+---------+-------+---+------+---------+---------+

top1price 背后的逻辑是迄今为止每时每刻的最高价格。 例如，在时间 31951.520 ( id = 6 )，到那时的最高价格是 149.370，它来自行id=6 。 那时第二高的价格是来自 row2 的 145.38。 我有兴趣再添加两列top1priceQty和top2priceQty 。 同样的例子，在第 6 行，最高价格为 149.370，其对应数量为 300，也来自第 6 行top2price为 145.380，其top2priceQty为 100，也来自第 2 行。

 |  8|31951.522|109.370| 50|    50|  149.870|  149.370|

对于第 8 行， top1price是 149.870，它来自第 11 行，因为第 8 行到第 11 行是同一时刻。 所以到那时，149.870 是最高价格，其对应的top1priceQty将是top2price是 149.370，来自第 7 行，因此对应的top2priceQty是 400，也来自第 7 行。

先感谢您！

Answer 1

这是一个有趣的问题。 我想在这里提几点。

在这里，我们需要获取“到目前为止”价格列的最大值和第二个最大值。
“到目前为止”意味着我们需要使用所有数据直到给定点并且不需要分区。
为了指定无界 window 规范以启动 window，我们将使用“Window.unboundedPreceding”。
我们将使用 rowsBetween function 来指定从开始到当前行的 window。

// 让我们创建一个示例 DataFrame

// Lets create a sample DataFrame
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df = Seq((1, 31951.509, 0.370, 1, 1),
              (2, 31951.515, 145.380, 100,100),
              (3, 31951.519, 149.370, 100, 100),
              (4, 31951.520, 144.370, 100, 100),
              (5, 31951.520, 149.370, 300, 200),
              (6, 31951.520, 119.370, 5, 5))
        .toDF("id", "tim", "price", "qty", "qtyChg")
        .orderBy("id")

// Define the window specification which starts from beginning (specified by "Window.unboundedPreceding") and and end at current row (specified by value 0).
val winSpec = Window.orderBy("tim").rowsBetween(Window.unboundedPreceding, 0)

// Collect all the values and sort them in descending order.
val df1 = df.withColumn("sort_array", sort_array(collect_list(struct("price", "qty")).over(winSpec), asc=false))

// Fectch the elements at position 1 and 2 which represent the max and second max value.

val df2 = df1//.withColumn("top1price", element_at(sort_array(array_distinct($"sort_array.price"), asc=false), 1))
             //.withColumn("top2price", element_at(sort_array(array_distinct($"sort_array.price"), asc=false), 2))
             .withColumn("top1price", element_at(array_distinct($"sort_array.price"), 1))
             .withColumn("top2price", element_at(array_distinct($"sort_array.price"), 2))
             .withColumn("top1priceQty", element_at($"sort_array.qty", 1))
             .withColumn("top2priceQty", element_at($"sort_array.qty", 2))
             .drop("sort_array")

// Display the result.
df2.show(truncate= false)

 // Output

+---+---------+------+---+------+---------+---------+------------+------------+
|id |tim      |price |qty|qtyChg|top1price|top2price|top1priceQty|top2priceQty|
+---+---------+------+---+------+---------+---------+------------+------------+
|1  |31951.509|0.37  |1  |1     |0.37     |null     |1           |null        |
|2  |31951.515|145.38|100|100   |145.38   |0.37     |100         |1           |
|3  |31951.519|149.37|100|100   |149.37   |145.38   |100         |100         |
|4  |31951.52 |144.37|100|100   |149.37   |145.38   |100         |100         |
|5  |31951.52 |149.37|300|200   |149.37   |145.38   |300         |100         |
|6  |31951.52 |119.37|5  |5     |149.37   |145.38   |300         |100         |
+---+---------+------+---+------+---------+---------+------------+------------+

我希望这有帮助。

Answer 2

我尝试使用以下方法解决此问题。

请注意：

我没有进行任何性能测试，因此在根据要求使用一些数据变化进行一些性能实验后使用它。 以下解决方案将严重影响性能，因为windowing中没有partitionBy子句
请检查 output，看看它是否满足您的要求。 我还没有逐行匹配。 但我认为它应该工作。

代码（不言自明）

加载数据

val data =
      """
        |id|      tim|  price|qty|qtyChg
        | 1|31951.509|  0.370|  1|     1
        | 2|31951.515|145.380|100|   100
        | 3|31951.519|149.370|100|   100
        | 4|31951.520|144.370|100|   100
        | 5|31951.520|149.370|300|   200
        | 6|31951.520|119.370|  5|     5
        | 7|31951.521|149.370|400|   100
        | 8|31951.522|109.370| 50|    50
        | 9|31951.522|149.370|410|    10
        |10|31951.522|144.370|400|   300
        |11|31951.522|149.870| 50|    50
        |12|31951.524|149.370|610|   200
        |13|31951.526|135.130| 22|    22
        |14|31951.527|149.370|750|   140
        |15|31951.528| 89.370|100|   100
        |16|31951.528|145.870| 50|    50
        |17|31951.528|139.370|100|   100
        |18|31951.531|149.370|769|    19
        |19|31951.531|144.370|410|    10
        |20|31951.538|149.370|869|   100
      """.stripMargin

    val stringDS = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .csv(stringDS)
    df.show(false)
    df.printSchema()

输出-

+---+---------+------+---+------+
|id |tim      |price |qty|qtyChg|
+---+---------+------+---+------+
|1  |31951.509|0.37  |1  |1     |
|2  |31951.515|145.38|100|100   |
|3  |31951.519|149.37|100|100   |
|4  |31951.52 |144.37|100|100   |
|5  |31951.52 |149.37|300|200   |
|6  |31951.52 |119.37|5  |5     |
|7  |31951.521|149.37|400|100   |
|8  |31951.522|109.37|50 |50    |
|9  |31951.522|149.37|410|10    |
|10 |31951.522|144.37|400|300   |
|11 |31951.522|149.87|50 |50    |
|12 |31951.524|149.37|610|200   |
|13 |31951.526|135.13|22 |22    |
|14 |31951.527|149.37|750|140   |
|15 |31951.528|89.37 |100|100   |
|16 |31951.528|145.87|50 |50    |
|17 |31951.528|139.37|100|100   |
|18 |31951.531|149.37|769|19    |
|19 |31951.531|144.37|410|10    |
|20 |31951.538|149.37|869|100   |
+---+---------+------+---+------+

root
 |-- id: integer (nullable = true)
 |-- tim: double (nullable = true)
 |-- price: double (nullable = true)
 |-- qty: integer (nullable = true)
 |-- qtyChg: integer (nullable = true)

获取最高和第二高的价格和相关数量

   val w = Window.orderBy("tim").rangeBetween(Window.unboundedPreceding, Window.currentRow)
    val w1 = Window.orderBy("tim")

    val processedDF = df.withColumn("maxPriceQty", max(struct(col("price"), col("qty"))).over(w))
      .withColumn("secondMaxPriceQty", lag(col("maxPriceQty"), 1).over(w1))
      .withColumn("top1price", col("maxPriceQty.price"))
      .withColumn("top1priceQty", col("maxPriceQty.qty"))
      .withColumn("top2price", col("secondMaxPriceQty.price"))
      .withColumn("top2priceQty", col("secondMaxPriceQty.qty"))
      processedDF.show(false)

输出-

+---+---------+------+---+------+-------------+-----------------+---------+------------+---------+------------+
|id |tim      |price |qty|qtyChg|maxPriceQty  |secondMaxPriceQty|top1price|top1priceQty|top2price|top2priceQty|
+---+---------+------+---+------+-------------+-----------------+---------+------------+---------+------------+
|1  |31951.509|0.37  |1  |1     |[0.37, 1]    |null             |0.37     |1           |null     |null        |
|2  |31951.515|145.38|100|100   |[145.38, 100]|[0.37, 1]        |145.38   |100         |0.37     |1           |
|3  |31951.519|149.37|100|100   |[149.37, 100]|[145.38, 100]    |149.37   |100         |145.38   |100         |
|4  |31951.52 |144.37|100|100   |[149.37, 300]|[149.37, 100]    |149.37   |300         |149.37   |100         |
|5  |31951.52 |149.37|300|200   |[149.37, 300]|[149.37, 300]    |149.37   |300         |149.37   |300         |
|6  |31951.52 |119.37|5  |5     |[149.37, 300]|[149.37, 300]    |149.37   |300         |149.37   |300         |
|7  |31951.521|149.37|400|100   |[149.37, 400]|[149.37, 300]    |149.37   |400         |149.37   |300         |
|8  |31951.522|109.37|50 |50    |[149.87, 50] |[149.37, 400]    |149.87   |50          |149.37   |400         |
|9  |31951.522|149.37|410|10    |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|10 |31951.522|144.37|400|300   |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|11 |31951.522|149.87|50 |50    |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|12 |31951.524|149.37|610|200   |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|13 |31951.526|135.13|22 |22    |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|14 |31951.527|149.37|750|140   |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|15 |31951.528|89.37 |100|100   |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|16 |31951.528|145.87|50 |50    |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|17 |31951.528|139.37|100|100   |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|18 |31951.531|149.37|769|19    |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|19 |31951.531|144.37|410|10    |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
|20 |31951.538|149.37|869|100   |[149.87, 50] |[149.87, 50]     |149.87   |50          |149.87   |50          |
+---+---------+------+---+------+-------------+-----------------+---------+------------+---------+------------+

请注意最后 4 个 output 列，以及有关 rowsBetween 和 rangeBetween 的说明，请点击此链接

Scala Spark DataFrame 问题：如何通过将当前行中的值与前一行中的某处匹配来添加新列

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-05-31 15:05:22

解决方案2
0 2020-05-31 06:12:13

代码（不言自明）

Scala Spark DataFrame 问题：如何通过将当前行中的值与前一行中的某处匹配来添加新列

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-05-31 15:05:22

解决方案2 0 2020-05-31 06:12:13

代码（不言自明）

解决方案1
1 已采纳 2020-05-31 15:05:22

解决方案2
0 2020-05-31 06:12:13