简体   繁体   中英

Spark Dataframe Scala: add new columns by some conditions

I revised my question so that it is easier to understand.

original df looks like this:

+---+----------+-------+----+------+
| id|tim       |price  | qty|qtyChg|
+---+----------+-------+----+------+
|  1| 31951.509|  0.370|   1|     1|
|  2| 31951.515|145.380| 100|   100|
|  3| 31951.519|149.370| 100|   100|
|  4| 31951.520|144.370| 100|   100|
|  5| 31951.520|119.370|   5|     5|
|  6| 31951.520|149.370| 300|   200|
|  7| 31951.521|149.370| 400|   100|
|  8| 31951.522|149.370| 410|    10|
|  9| 31951.522|149.870|  50|    50|
| 10| 31951.522|109.370|  50|    50|
| 11| 31951.522|144.370| 400|   300|
| 12| 31951.524|149.370| 610|   200|
| 13| 31951.526|135.130|  22|    22|
| 14| 31951.527|149.370| 750|   140|
| 15| 31951.528| 89.370| 100|   100|
| 16| 31951.528|145.870|  50|    50|
| 17| 31951.528|139.370| 100|   100|
| 18| 31951.531|144.370| 410|    10|
| 19| 31951.531|149.370| 769|    19|
| 20| 31951.538|149.370| 869|   100|
| 21| 31951.538|144.880| 200|   200|
| 22| 31951.541|139.370| 221|   121|
| 23| 31951.542|149.370|1199|   330|
| 24| 31951.542|139.370| 236|    15|
| 25| 31951.542|144.370| 510|   100|
| 26| 31951.543|146.250|  50|    50|
| 27| 31951.543|143.820| 100|   100|
| 28| 31951.543|139.370| 381|   145|
| 29| 31951.544|149.370|1266|    67|
| 30| 31951.544|150.000|  50|    50|
| 31| 31951.544|137.870| 300|   300|
| 32| 31951.544|140.470|  10|    10|
| 33| 31951.545|150.000|  53|     3|
| 34| 31951.545|140.000|  25|    25|
| 35| 31951.545|148.310|   8|     8|
| 36| 31951.547|149.000|  20|    20|
| 37| 31951.549|143.820| 102|     2|
| 38| 31951.549|150.110|  75|    75|
+---+----------+-------+----+------+

then I run the code

val ww = Window.partitionBy().orderBy($"tim") 

val step1 = df.withColumn("sequence",sort_array(collect_set(col("price")).over(ww),asc=false))
.withColumn("top1price",col("sequence").getItem(0))
.withColumn("top2price",col("sequence").getItem(1))
.drop("sequence")

The new dataframe looks like this:

+---+---------+-------+----+------+---------+---------+
| id|      tim|  price| qty|qtyChg|top1price|top2price|
+---+---------+-------+----+------+---------+---------+
|  1|31951.509|  0.370|   1|     1|    0.370|     null|
|  2|31951.515|145.380| 100|   100|  145.380|    0.370|
|  3|31951.519|149.370| 100|   100|  149.370|  145.380|
|  4|31951.520|149.370| 300|   200|  149.370|  145.380|
|  5|31951.520|144.370| 100|   100|  149.370|  145.380|
|  6|31951.520|119.370|   5|     5|  149.370|  145.380|
|  7|31951.521|149.370| 400|   100|  149.370|  145.380|
|  8|31951.522|109.370|  50|    50|  149.870|  149.370|
|  9|31951.522|144.370| 400|   300|  149.870|  149.370|
| 10|31951.522|149.870|  50|    50|  149.870|  149.370|
| 11|31951.522|149.370| 410|    10|  149.870|  149.370|
| 12|31951.524|149.370| 610|   200|  149.870|  149.370|
| 13|31951.526|135.130|  22|    22|  149.870|  149.370|
| 14|31951.527|149.370| 750|   140|  149.870|  149.370|
| 15|31951.528| 89.370| 100|   100|  149.870|  149.370|
| 16|31951.528|139.370| 100|   100|  149.870|  149.370|
| 17|31951.528|145.870|  50|    50|  149.870|  149.370|
| 18|31951.531|144.370| 410|    10|  149.870|  149.370|
| 19|31951.531|149.370| 769|    19|  149.870|  149.370|
| 20|31951.538|144.880| 200|   200|  149.870|  149.370|
| 21|31951.538|149.370| 869|   100|  149.870|  149.370|
| 22|31951.541|139.370| 221|   121|  149.870|  149.370|
| 23|31951.542|144.370| 510|   100|  149.870|  149.370|
| 24|31951.542|139.370| 236|    15|  149.870|  149.370|
| 25|31951.542|149.370|1199|   330|  149.870|  149.370|
| 26|31951.543|139.370| 381|   145|  149.870|  149.370|
| 27|31951.543|143.820| 100|   100|  149.870|  149.370|
| 28|31951.543|146.250|  50|    50|  149.870|  149.370|
| 29|31951.544|140.470|  10|    10|  150.000|  149.870|
| 30|31951.544|137.870| 300|   300|  150.000|  149.870|
| 31|31951.544|150.000|  50|    50|  150.000|  149.870|
| 32|31951.544|149.370|1266|    67|  150.000|  149.870|
| 33|31951.545|140.000|  25|    25|  150.000|  149.870|
| 34|31951.545|150.000|  53|     3|  150.000|  149.870|
| 35|31951.545|148.310|   8|     8|  150.000|  149.870|
| 36|31951.547|149.000|  20|    20|  150.000|  149.870|
| 37|31951.549|150.110|  75|    75|  150.110|  150.000|
| 38|31951.549|143.820| 102|     2|  150.110|  150.000|
+---+---------+-------+----+------+---------+---------+

I am hoping to get two new columns top1priceQty, top2priceQty which store the most updated corresponding qty of top1price and top2price.

For example, in row 6, top1price= 149.370, based on this value, I want to get its corresponding qty which is 400(not 100 or 300). in row 33, when top1price=150.00000000, I want to get its corresponding qty which is 53 that comes from row 32, not 50 from row 28. same rule apply to top2price

Thank you all in advance!

You were very close to the answer by yourself. Instead of collecting set of just one column, collect array of 'LMTPRICE' and it's corresponding 'qty'. Then use getItem(0).getItem(0) for top1price and getItem(0).getItem(1) for top1priceQty. To keep the order by INTEREST_TIME for getting correct qty, use INTEREST_TIME also after 'LMTPRICE' and before 'qty'.

df.withColumn("sequence",sort_array(collect_set(array("LMTPRICE","INTEREST_TIME","qty")).over(ww),asc=false)).withColumn("top1price",col("sequence").getItem(0).getItem(0)).withColumn("top1priceQty",col("sequence").getItem(0).getItem(2).cast("int")).drop("sequence").show(false)

+-----+-------------+--------+---+------+---------+------------+
|index|INTEREST_TIME|LMTPRICE|qty|qtyChg|top1price|top1priceQty|
+-----+-------------+--------+---+------+---------+------------+
|0    |31951.509    |0.37    |1  |1     |0.37     |1           |
|1    |31951.515    |145.38  |100|100   |145.38   |100         |
|2    |31951.519    |149.37  |100|100   |149.37   |100         |
|3    |31951.52     |119.37  |5  |5     |149.37   |300         |
|4    |31951.52     |144.37  |100|100   |149.37   |300         |
|5    |31951.52     |149.37  |300|200   |149.37   |300         |
|6    |31951.521    |149.37  |400|100   |149.37   |400         |
|7    |31951.522    |149.87  |50 |50    |149.87   |50          |
|8    |31951.522    |149.37  |410|10    |149.87   |50          |
|9    |31951.522    |109.37  |50 |50    |149.87   |50          |
|10   |31951.522    |144.37  |400|300   |149.87   |50          |
|11   |31951.524    |149.87  |610|200   |149.87   |610         |
|12   |31951.526    |135.13  |22 |22    |149.87   |610         |
|13   |31951.527    |149.37  |750|140   |149.87   |610         |
|14   |31951.528    |139.37  |100|100   |149.87   |610         |
|15   |31951.528    |145.87  |50 |50    |149.87   |610         |
|16   |31951.528    |89.37   |100|100   |149.87   |610         |
|17   |31951.531    |144.37  |410|10    |149.87   |610         |
|18   |31951.531    |149.37  |769|19    |149.87   |610         |
|19   |31951.538    |149.37  |869|100   |149.87   |610         |
+-----+-------------+--------+---+------+---------+------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM