[英]Loop through dataframe and update the lookup table simultaneously: spark scala
I have a DataFrame
like the following.我有一个
DataFrame
所示的DataFrame
。
+---+-------------+-----+
| id|AccountNumber|scale|
+---+-------------+-----+
| 1| 1500847| 6|
| 2| 1501199| 7|
| 3| 1119024| 3|
+---+-------------+-----+
I have to populate a second DataFrame
, which would initially be empty, as follows.我必须填充第二个
DataFrame
,它最初是空的,如下所示。
id AccountNumber scale
1 1500847 6
2 1501199 6
3 1119024 3
First row in the first DataFrame
has a scale
of 6. Check for that value minus 1 (so scale
equals 5) in the result.第一个
DataFrame
中的第一行的scale
为 6。在结果中检查该值是否为负 1(因此scale
等于 5)。 There none, so simply add the row (1,1500847,6)
to the output.没有,所以只需将行
(1,1500847,6)
添加到输出中。
The second row in the output has a scale
of 7. The original table already has a row with scale
7 - 1, so add this row but with that scale (2, 15001199, 6)
.输出中的第二行的
scale
为 7。原始表已经有一个scale
7 - 1 的行,因此添加这一行,但比例为(2, 15001199, 6)
。
The third row works as the first one.第三行作为第一行。
Using broadcasted list使用广播列表
You can collect all the scales in scale
column as an Array and broadcast
it to be used in udf
function.您可以收集所有的尺度
scale
列数组和broadcast
它在使用udf
功能。 Then use the udf
function in when
logic with withColumn
as然后使用
udf
在功能when
逻辑与withColumn
如
import org.apache.spark.sql.functions._
val collectedList = sc.broadcast(df.select(collect_list("scale")).collect()(0)(0).asInstanceOf[collection.mutable.WrappedArray[Int]])
import org.apache.spark.sql.functions._
def newScale = udf((scale: Int)=> collectedList.value.contains(scale))
df.withColumn("scale", when(newScale(col("scale")-1), col("scale")-1).otherwise(col("scale")))
.show(false)
You should have desired output as你应该有想要的输出
+---+-------------+-----+
|id |AccountNumber|scale|
+---+-------------+-----+
|1 |1500847 |6 |
|2 |1501199 |6 |
|3 |1119024 |3 |
+---+-------------+-----+
Using Window function使用窗口功能
The solution I am going to suggest would require you to collect all the data in one executor using Window
function to form another column scaleCheck
which will be populated with all the scales present in scale
column as我要建议的解决方案将要求您使用
Window
函数收集一个执行器中的所有数据以形成另一列scaleCheck
将填充scale
列中存在的所有scale
作为
import org.apache.spark.sql.expressions.Window
def windowSpec = Window.orderBy("id").rowsBetween(Long.MinValue, Long.MaxValue)
val tempdf = df.withColumn("scaleCheck", collect_list("scale").over(windowSpec))
this would give you dataframe
这会给你
dataframe
+---+-------------+-----+----------+
|id |AccountNumber|scale|scaleCheck|
+---+-------------+-----+----------+
|1 |1500847 |6 |[6, 7, 3] |
|2 |1501199 |7 |[6, 7, 3] |
|3 |1119024 |3 |[6, 7, 3] |
+---+-------------+-----+----------+
Then you would have to write a udf
function to check whether the scale in the row is already present in the collected list .然后您必须编写一个
udf
函数来检查行中的比例是否已经存在于收集的列表中。 Then using when
function and calling the udf
function, you can generate the scale
value然后使用
when
函数并调用udf
函数,就可以生成scale
值
import org.apache.spark.sql.functions._
def newScale = udf((scale: Int, scaleCheck: collection.mutable.WrappedArray[Int])=> scaleCheck.contains(scale))
tempdf.withColumn("scale", when(newScale(col("scale")-1, col("scaleCheck")), col("scale")-1).otherwise(col("scale")))
.drop("scaleCheck")
.show(false)
So your final required dataframe
is achieved which is given above因此,您最终需要的
dataframe
已实现,这是上面给出的
I hope the answer is helpful我希望答案有帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.