简体   繁体   English

数据框获取对应列的第一个和最后一个值

[英]Dataframe get first and last value of corresponding column

Is it possible to get first value of the corresponding column within subgroup. 是否有可能获得子组内相应列的第一个值。

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{Window, WindowSpec}

object tmp {
  def main(args: Array[String]): Unit = {
    val spark =  SparkSession.builder().master("local").getOrCreate()
    import spark.implicits._

    val input = Seq(
      (1235,  1, 1101, 0),
      (1235,  2, 1102, 0),
      (1235,  3, 1103, 1),
      (1235,  4, 1104, 1),
      (1235,  5, 1105, 0),
      (1235,  6, 1106, 0),
      (1235,  7, 1107, 1),
      (1235,  8, 1108, 1),
      (1235,  9, 1109, 1),
      (1235, 10, 1110, 0),
      (1235, 11, 1111, 0)
    ).toDF("SERVICE_ID", "COUNTER", "EVENT_ID", "FLAG")

    lazy val window: WindowSpec = Window.partitionBy("SERVICE_ID").orderBy("COUNTER")
    val firsts = input.withColumn("first_value", first("EVENT_ID", ignoreNulls = true).over(window.rangeBetween(Long.MinValue, Long.MaxValue)))
    firsts.orderBy("SERVICE_ID", "COUNTER").show()

  }
}

Output I want. 我想要的输出。

First (or Previous) value of column EVENT_ID based on FLAG = 1 And Last (or Next ) value of column EVENT_ID based on FLAG = 1 partition by SERVICE_ID sorted by counter 基于FLAG = 1的EVENT_ID列的第一个(或上一个)值,以及基于FLAG = 1分区的EVENT_ID列的EVENT_ID的上一个(或下一个)值

+----------+-------+--------+----+-----------+-----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|last_value|
+----------+-------+--------+----+-----------+-----------+
|      1235|      1|    1101|   0|          0|       1103|
|      1235|      2|    1102|   0|          0|       1103|
|      1235|      3|    1103|   1|          0|       1106|
|      1235|      4|    1104|   0|       1103|       1106|
|      1235|      5|    1105|   0|       1103|       1106|
|      1235|      6|    1106|   1|          0|       1108|
|      1235|      7|    1107|   0|       1106|       1108|
|      1235|      8|    1108|   1|          0|       1109|
|      1235|      9|    1109|   1|          0|       1110|
|      1235|     10|    1110|   1|          0|          0|
|      1235|     11|    1111|   0|       1110|          0|
|      1235|     12|    1112|   0|       1110|          0|
+----------+-------+--------+----+-----------+-----------+

First the dataframe need to be formed into groups. 首先,数据帧需要分组。 A new group starts at each time the "TIME" column equals 1. To do this, first add a column "ID" to the dataframe: 每当“ TIME”列等于1时,就会开始一个新的组。为此,首先向数据帧添加一列“ ID”:

lazy val window: WindowSpec = Window.partitionBy("SERVICE_ID").orderBy("COUNTER")
val df_flag = input.filter($"FLAG" === 1)
  .withColumn("ID", row_number().over(window))
val df_other = input.filter($"FLAG" =!= 1)
  .withColumn("ID", lit(0))

// Create a group for each flag event
val df = df_flag.union(df_other)
  .withColumn("ID", max("ID").over(window.rowsBetween(Long.MinValue, 0)))
  .cache()

df.show() gives: df.show()提供:

+----------+-------+--------+----+---+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG| ID|
+----------+-------+--------+----+---+
|      1235|      1|    1111|   1|  1|
|      1235|      2|    1112|   0|  1|
|      1235|      3|    1114|   0|  1|
|      1235|      4|    2221|   1|  2|
|      1235|      5|    2225|   0|  2|
|      1235|      6|    2226|   0|  2|
|      1235|      7|    2227|   1|  3|
+----------+-------+--------+----+---+

Now that we have a column separating the events, we need to add the correct "EVENT_ID" (renamed "first_value") to each event. 现在我们有了一个分隔事件的列,我们需要为每个事件添加正确的“ EVENT_ID”(重命名为“ first_value”)。 In addition to the "first_value", calculate and add a second column "last_value", which is the id of the next flagged event. 除了“ first_value”之外,计算并添加第二列“ last_value”,它是下一个标记事件的ID。

val df_event = df.filter($"FLAG" === 1)
  .select("EVENT_ID", "ID", "SERVICE_ID", "COUNTER")
  .withColumnRenamed("EVENT_ID", "first_value")
  .withColumn("last_value", lead($"first_value",1,0).over(window))
  .drop("COUNTER")

val df_final = df.join(df_event, Seq("ID", "SERVICE_ID"))
  .drop("ID")
  .withColumn("first_value", when($"FLAG" === 1, lit(0)).otherwise($"first_value"))

df_final.show() gives us: df_final.show()给我们:

+----------+-------+--------+----+-----------+----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|last_value|
+----------+-------+--------+----+-----------+----------+
|      1235|      1|    1111|   1|          0|      2221|
|      1235|      2|    1112|   0|       1111|      2221|
|      1235|      3|    1114|   0|       1111|      2221|
|      1235|      4|    2221|   1|          0|      2227|
|      1235|      5|    2225|   0|       2221|      2227|
|      1235|      6|    2226|   0|       2221|      2227|
|      1235|      7|    2227|   1|          0|         0|
+----------+-------+--------+----+-----------+----------+

Can be solved in two steps: 可以分两步解决:

  1. get events with "FLAG" == 1 and valid range for this event; 获取“ FLAG” == 1的事件以及该事件的有效范围;
  2. join 1. with input, by range. 通过范围加入1.与输入。

Some column renaming included for visibility, can be shortened: 可以缩短某些列重命名的可见性,可以缩短:

val window = Window.partitionBy("SERVICE_ID").orderBy("COUNTER").rowsBetween(Window.currentRow, 1)
val eventRangeDF = input.where($"FLAG" === 1)
  .withColumn("RANGE_END", max($"COUNTER").over(window))
  .withColumnRenamed("COUNTER", "RANGE_START")
  .select("SERVICE_ID", "EVENT_ID", "RANGE_START", "RANGE_END")
eventRangeDF.show(false)

val result = input.where($"FLAG" === 0).as("i").join(eventRangeDF.as("e"),
  expr("e.SERVICE_ID=i.SERVICE_ID And i.COUNTER>e.RANGE_START and i.COUNTER<e.RANGE_END"))
  .select($"i.SERVICE_ID", $"i.COUNTER", $"i.EVENT_ID", $"i.FLAG", $"e.EVENT_ID".alias("first_value"))
  // include FLAG=1
  .union(input.where($"FLAG" === 1).select($"SERVICE_ID", $"COUNTER", $"EVENT_ID", $"FLAG", lit(0).alias("first_value")))

result.sort("COUNTER").show(false)

Output: 输出:

+----------+--------+-----------+---------+
|SERVICE_ID|EVENT_ID|RANGE_START|RANGE_END|
+----------+--------+-----------+---------+
|1235      |1111    |1          |4        |
|1235      |2221    |4          |7        |
|1235      |2227    |7          |7        |
+----------+--------+-----------+---------+

+----------+-------+--------+----+-----------+
|SERVICE_ID|COUNTER|EVENT_ID|FLAG|first_value|
+----------+-------+--------+----+-----------+
|1235      |1      |1111    |1   |0          |
|1235      |2      |1112    |0   |1111       |
|1235      |3      |1114    |0   |1111       |
|1235      |4      |2221    |1   |0          |
|1235      |5      |2225    |0   |2221       |
|1235      |6      |2226    |0   |2221       |
|1235      |7      |2227    |1   |0          |
+----------+-------+--------+----+-----------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在DataFrame列的多个数组中获取对应项的最大值? - How to get max value of corresponding item in many arrays in DataFrame column? 如何在spark scala数据框中获取与某列的最小值相对应的行 - how to get the row corresponding to the minimum value of some column in spark scala dataframe 使用 spark 数据框进行分组时,获取带条件的列的第一个值 - Get first value of column with condition when group by use spark dataframe 获取Spark Dataframe列中列表的最后一个元素 - Get last element of list in Spark Dataframe column 使用列中最后出现的值修剪Spark中的数据框 - trim dataframe in spark using last appearance of value in the column Spark 数据框将列值获取到字符串变量中 - Spark dataframe get column value into a string variable 按列“grp”分组并压缩数据帧 - (按列“ord”排序的每列取最后一个非空值) - Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord") scala:从变量列列表中获取与最大列值对应的列名 - scala: get column name corresponding to max column value from variable columns list 聚合期间在数据帧上获取第一个非空值的火花scala - Get first not null value spark scala on a dataframe during aggregation Scala-Spark-如何获取具有数据框列的不同值以及此不同值的第一个日期的新数据框? - Scala - Spark - How can I get a new dataframe with distinct values of a dataframe column and the first date of this distinct values?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM