在Spark Scala中將當前行中的前一行值求和

Question

我正在嘗試根據其他數據框中的值調整列值之一。 這樣做時，如果剩余量更多，則需要結轉到下一行並計算最終金額。

在此操作期間，我無法保留上一行剩余的金額到下一行操作。 我嘗試使用滯后窗口功能並采用運行總計選項，但這些選項未按預期工作。

我正在與Scala合作。 這是輸入數據

val consumption = sc.parallelize(Seq((20180101, 600), (20180201, 900),(20180301, 400),(20180401, 600),(20180501, 1000),(20180601, 1900),(20180701, 500),(20180801, 100),(20180901, 500))).toDF("Month","Usage")
consumption.show()

+--------+-----+
|   Month|Usage|
+--------+-----+
|20180101|  600|
|20180201|  900|
|20180301|  400|
|20180401|  600|
|20180501| 1000|
|20180601| 1900|
|20180701|  500|
|20180801|  100|
|20180901|  500|
+--------+-----+

val promo = sc.parallelize(Seq((20180101, 1000),(20180201, 100),(20180401, 3000))).toDF("PromoEffectiveMonth","promoAmount")
promo.show()

+-------------------+-----------+
|PromoEffectiveMonth|promoAmount|
+-------------------+-----------+
|           20180101|       1000|
|           20180201|        100|
|           20180401|       3000|
+-------------------+-----------+

預期結果：

val finaldf = sc.parallelize(Seq((20180101,600,400,600),(20180201,900,0,400),(20180301,400,0,0),(20180401,600,2400,600),(20180501,1000,1400,1000),(20180601,1900,0,500),(20180701,500,0,0),(20180801,100,0,0),(20180901,500,0,0))).toDF("Month","Usage","LeftOverPromoAmt","AdjustedUsage")
finaldf.show()

+--------+-----+----------------+-------------+
|   Month|Usage|LeftOverPromoAmt|AdjustedUsage|
+--------+-----+----------------+-------------+
|20180101|  600|             400|          600|
|20180201|  900|               0|          400|
|20180301|  400|               0|            0|
|20180401|  600|            2400|          600|
|20180501| 1000|            1400|         1000|
|20180601| 1900|               0|          500|
|20180701|  500|               0|            0|
|20180801|  100|               0|            0|
|20180901|  500|               0|            0|
+--------+-----+----------------+-------------+

我要應用的邏輯基於“月”和“促銷有效聯接”，需要在消費使用列上應用促銷金額，直到促銷金額變為零。

例如：在1月18日，促銷金額為1000，從使用量（600）中減去后，剩余的促銷金額為400，調整后的使用量為600。剩余的400將會在下個月考慮，2月的促銷金額那么最終的促銷金額為500。與使用量相比，此處的使用量更大。

因此剩余的促銷金額為零，調整使用量為400（900-500）。

Answer 1

首先，您需要執行left_outer連接，以便對每一行都有相應的提升。 分別通過數據集Consumption和promo Month和PromoEffectiveMonth字段執行PromoEffectiveMonth 。 還要注意，我已經創建了一個新列Timestamp 。 它是通過使用Spark SQL unix_timestamp函數創建的。 它將用於按日期對數據集進行排序。

val ds = consumption
    .join(promo, consumption.col("Month") === promo.col("PromoEffectiveMonth"), "left_outer")
    .select("UserID", "Month", "Usage", "promoAmount")
    .withColumn("Timestamp", unix_timestamp($"Month".cast("string"), "yyyyMMdd").cast(TimestampType))

這是這些操作的結果。

+--------+-----+-----------+-------------------+
|   Month|Usage|promoAmount|          Timestamp|
+--------+-----+-----------+-------------------+
|20180301|  400|       null|2018-03-01 00:00:00|
|20180701|  500|       null|2018-07-01 00:00:00|
|20180901|  500|       null|2018-09-01 00:00:00|
|20180101|  600|       1000|2018-01-01 00:00:00|
|20180801|  100|       null|2018-08-01 00:00:00|
|20180501| 1000|       null|2018-05-01 00:00:00|
|20180201|  900|        100|2018-02-01 00:00:00|
|20180601| 1900|       null|2018-06-01 00:00:00|
|20180401|  600|       3000|2018-04-01 00:00:00|
+--------+-----+-----------+-------------------+

接下來，您必須創建一個Window 。 窗口函數用於通過使用某些條件對一組記錄進行計算（有關更多信息，請參見此處）。 在我們的例子中，標准是按Timestamp對每個組進行排序。

 val window = Window.orderBy("Timestamp")

好的，現在是最困難的部分。 您需要創建一個用戶定義的聚合函數。 在此功能中，將根據自定義操作對每個組進行處理，並使您可以通過考慮上一行的值來處理每一行。

  class CalculatePromos extends UserDefinedAggregateFunction {
    // Input schema for this UserDefinedAggregateFunction
    override def inputSchema: StructType =
      StructType(
        StructField("Usage", LongType) ::
        StructField("promoAmount", LongType) :: Nil)

    // Schema for the parameters that will be used internally to buffer temporary values
    override def bufferSchema: StructType = StructType(
        StructField("AdjustedUsage", LongType) ::
        StructField("LeftOverPromoAmt", LongType) :: Nil
    )

    // The data type returned by this UserDefinedAggregateFunction.
    // In this case, it will return an StructType with two fields: AdjustedUsage and LeftOverPromoAmt
    override def dataType: DataType = StructType(Seq(StructField("AdjustedUsage", LongType), StructField("LeftOverPromoAmt", LongType)))

    // Whether this UDAF is deterministic or not. In this case, it is
    override def deterministic: Boolean = true

    // Initial values for the temporary values declared above
    override def initialize(buffer: MutableAggregationBuffer): Unit = {
      buffer(0) = 0L
      buffer(1) = 0L
    }

    // In this function, the values associated to the buffer schema are updated
    override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {

      val promoAmount = if(input.isNullAt(1)) 0L else input.getLong(1)
      val leftOverAmount = buffer.getLong(1)
      val usage = input.getLong(0)
      val currentPromo = leftOverAmount + promoAmount

      if(usage < currentPromo) {
        buffer(0) = usage
        buffer(1) = currentPromo - usage
      } else {
        if(currentPromo == 0)
          buffer(0) = 0L
        else
          buffer(0) = usage - currentPromo
        buffer(1) = 0L
      }
    }

    // Function used to merge two objects. In this case, it is not necessary to define this method since
    // the whole logic has been implemented in update
    override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {}

    // It is what you will return. In this case, a tuple of the buffered values which rerpesent AdjustedUsage and LeftOverPromoAmt
    override def evaluate(buffer: Row): Any = {
      (buffer.getLong(0), buffer.getLong(1))
    }

  }

基本上，它創建其可以在接收兩列（火花SQL使用的功能Usage和promoAmount ，如在方法中指定inputSchema ），以及具有兩個subcolums返回一個新的柱（ AdjustedUsage和LeftOverPromAmt ，如在方法中定義dataType ）。 使用bufferSchema方法，您可以創建用於支持操作的臨時值。 在這種情況下，我定義了AdjustedUsage和LeftOverPromoAmt 。

您要應用的邏輯在方法update 。 基本上，它將采用先前計算的值並進行更新。 參數buffer包含在bufferSchema定義的臨時值， input保留該時刻正在處理的行的值。 最后， evaluate返回一個元組對象，其中包含每一行的操作結果，在這種情況下，是在bufferSchema定義並在方法update的臨時值。

下一步是通過實例化CalculatePromos類來創建變量。

val calculatePromos = new CalculatePromos

最后，您必須使用數據集的withColumn方法來應用用戶定義的聚合函數calculatePromos 。 請注意，您必須將輸入列（ Usage和promoAmount ）傳遞給它，然后通過使用方法來應用窗口。

ds
  .withColumn("output", calculatePromos($"Usage", $"promoAmount").over(window))
  .select($"Month", $"Usage", $"output.LeftOverPromoAmt".as("LeftOverPromoAmt"), $"output.AdjustedUsage".as("AdjustedUsage"))

結果如下：

+--------+-----+----------------+-------------+
|   Month|Usage|LeftOverPromoAmt|AdjustedUsage|
+--------+-----+----------------+-------------+
|20180101|  600|             400|          600|
|20180201|  900|               0|          400|
|20180301|  400|               0|            0|
|20180401|  600|            2400|          600|
|20180501| 1000|            1400|         1000|
|20180601| 1900|               0|          500|
|20180701|  500|               0|            0|
|20180801|  100|               0|            0|
|20180901|  500|               0|            0|
+--------+-----+----------------+-------------+

希望能幫助到你。

在Spark Scala中將當前行中的前一行值求和

問題描述

1 個解決方案

解決方案1
5 已采納 2019-02-11 14:08:01

在Spark Scala中將當前行中的前一行值求和

問題描述

1 個解決方案

解決方案1 5 已采納 2019-02-11 14:08:01

解決方案1
5 已采納 2019-02-11 14:08:01