简体   繁体   English

循环遍历数据框并同时更新查找表:spark scala

[英]Loop through dataframe and update the lookup table simultaneously: spark scala

I have a DataFrame like the following.我有一个DataFrame所示的DataFrame

+---+-------------+-----+
| id|AccountNumber|scale|
+---+-------------+-----+
|  1|      1500847|    6|
|  2|      1501199|    7|
|  3|      1119024|    3|
+---+-------------+-----+

I have to populate a second DataFrame , which would initially be empty, as follows.我必须填充第二个DataFrame ,它最初是空的,如下所示。

id  AccountNumber   scale
1   1500847         6
2   1501199         6
3   1119024         3

Output explaination输出说明

First row in the first DataFrame has a scale of 6. Check for that value minus 1 (so scale equals 5) in the result.第一个DataFrame中的第一行的scale为 6。在结果中检查该值是否为负 1(因此scale等于 5)。 There none, so simply add the row (1,1500847,6) to the output.没有,所以只需将行(1,1500847,6)添加到输出中。

The second row in the output has a scale of 7. The original table already has a row with scale 7 - 1, so add this row but with that scale (2, 15001199, 6) .输出中的第二行的scale为 7。原始表已经有一个scale 7 - 1 的行,因此添加这一行,但比例为(2, 15001199, 6)

The third row works as the first one.第三行作为第一行。

Using broadcasted list使用广播列表

You can collect all the scales in scale column as an Array and broadcast it to be used in udf function.您可以收集所有的尺度scale数组broadcast它在使用udf功能。 Then use the udf function in when logic with withColumn as然后使用udf在功能when逻辑与withColumn

import org.apache.spark.sql.functions._
val collectedList = sc.broadcast(df.select(collect_list("scale")).collect()(0)(0).asInstanceOf[collection.mutable.WrappedArray[Int]])

import org.apache.spark.sql.functions._
def newScale = udf((scale: Int)=> collectedList.value.contains(scale))

df.withColumn("scale", when(newScale(col("scale")-1), col("scale")-1).otherwise(col("scale")))
  .show(false)

You should have desired output as你应该有想要的输出

+---+-------------+-----+
|id |AccountNumber|scale|
+---+-------------+-----+
|1  |1500847      |6    |
|2  |1501199      |6    |
|3  |1119024      |3    |
+---+-------------+-----+

Using Window function使用窗口功能

The solution I am going to suggest would require you to collect all the data in one executor using Window function to form another column scaleCheck which will be populated with all the scales present in scale column as我要建议的解决方案将要求您使用Window函数收集一个执行器中的所有数据以形成另一列scaleCheck将填充scale列中存在的所有scale作为

import org.apache.spark.sql.expressions.Window
def windowSpec = Window.orderBy("id").rowsBetween(Long.MinValue, Long.MaxValue)
val tempdf = df.withColumn("scaleCheck", collect_list("scale").over(windowSpec))

this would give you dataframe这会给你dataframe

+---+-------------+-----+----------+
|id |AccountNumber|scale|scaleCheck|
+---+-------------+-----+----------+
|1  |1500847      |6    |[6, 7, 3] |
|2  |1501199      |7    |[6, 7, 3] |
|3  |1119024      |3    |[6, 7, 3] |
+---+-------------+-----+----------+

Then you would have to write a udf function to check whether the scale in the row is already present in the collected list .然后您必须编写一个udf函数来检查行中的比例是否已经存在于收集的列表中 Then using when function and calling the udf function, you can generate the scale value然后使用when函数并调用udf函数,就可以生成scale

import org.apache.spark.sql.functions._
def newScale = udf((scale: Int, scaleCheck: collection.mutable.WrappedArray[Int])=> scaleCheck.contains(scale))

tempdf.withColumn("scale", when(newScale(col("scale")-1, col("scaleCheck")), col("scale")-1).otherwise(col("scale")))
  .drop("scaleCheck")
  .show(false)

So your final required dataframe is achieved which is given above因此,您最终需要的dataframe已实现,这是上面给出的

I hope the answer is helpful我希望答案有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM