[英]Spark SQL : Is it possible to process row after row in an pre-defined order for a given data frame partition?
I am looking to partition my data frame(df) based on a column (SECURITY_ID) and then run df.foreachpartition(customfunction).我希望根据列 (SECURITY_ID) 对我的数据框 (df) 进行分区,然后运行 df.foreachpartition(customfunction)。 This is working fine.
这工作正常。
Inside each partition, based on a column (RANK) the data has to be ordered.在每个分区内,必须根据列 (RANK) 对数据进行排序。 This is working fine.
这工作正常。
Now, based on the the order I want to process row after row in sequential order for each partition.现在,根据我想为每个分区按顺序逐行处理的顺序。 For example -
例如 -
Base dataframe:底座 dataframe:
+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY| RANK|
+-------------+----------+----------+------
32934789| 290X2| -98763| 3|
3S534789| 290X2| 45300| 2|
3FA34789| 290X2| 12763| 1|
00000019| 290X2|-10177400| 4|
92115301| 35G71| 8003| 2|
91615301| 35G71| -2883| 1|
After partition and order by分区和排序后
+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY| RANK|
+-------------+----------+----------+------
3FA34789| 290X2| 12763| 1|
3S534789| 290X2| 45300| 2|
32934789| 290X2| -98763| 3|
00000019| 290X2|-10177400| 4|
+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY| RANK|
+-------------+----------+----------+------
91615301| 35G71| -2883| 1|
92115301| 35G71| 8003| 2|
Let us consider this partition让我们考虑这个分区
+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY| RANK|
+-------------+----------+----------+------
3FA34789| 290X2| 12763| 1|
3S534789| 290X2| 45300| 2|
32934789| 290X2| -98763| 3|
00000019| 290X2|-10177400| 4|
I need to process rows one after the other based on the rank in the increasing order.我需要根据升序依次处理行。
This seems to be fine on single node machine.这在单节点机器上似乎很好。 But I see that the processing is getting jumbled when running on a multi-node cluster.
但是我看到在多节点集群上运行时处理变得混乱。
How can I make sure that the order is guaranteed?如何确保订单有保障?
Could you please try with coalesce(1) followed by sort(cols:*) operation over the SECURITY_ID partitioned Datafame to get new Datafame/Dataset sorted by the specified columns, all in ascending order.您能否尝试在 SECURITY_ID 分区 Datafame 上使用 coalesce(1) 然后 sort(cols:*) 操作,以获取按指定列排序的新 Datafame/Dataset,所有这些都按升序排列。
df.coalesce(1).sort("RANK").foreach(row => process(row))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.