Spark SQL：是否可以为给定的数据帧分区以预定义的顺序逐行处理？

Question

I am looking to partition my data frame(df) based on a column (SECURITY_ID) and then run df.foreachpartition(customfunction).我希望根据列 (SECURITY_ID) 对我的数据框 (df) 进行分区，然后运行 df.foreachpartition(customfunction)。 This is working fine.这工作正常。

Inside each partition, based on a column (RANK) the data has to be ordered.在每个分区内，必须根据列 (RANK) 对数据进行排序。 This is working fine.这工作正常。

Now, based on the the order I want to process row after row in sequential order for each partition.现在，根据我想为每个分区按顺序逐行处理的顺序。 For example -例如 -

Base dataframe:底座 dataframe：

+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  32934789|      290X2|   -98763|       3|
  3S534789|      290X2|    45300|       2|
  3FA34789|      290X2|    12763|       1|
  00000019|      290X2|-10177400|       4|
  92115301|      35G71|     8003|       2|
  91615301|      35G71|    -2883|       1|

After partition and order by分区和排序后

+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  3FA34789|      290X2|    12763|       1|  
  3S534789|      290X2|    45300|       2|
  32934789|      290X2|   -98763|       3|
  00000019|      290X2|-10177400|       4|



+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  91615301|      35G71|    -2883|       1|
  92115301|      35G71|     8003|       2|

Let us consider this partition让我们考虑这个分区

+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  3FA34789|      290X2|    12763|       1|  
  3S534789|      290X2|    45300|       2|
  32934789|      290X2|   -98763|       3|
  00000019|      290X2|-10177400|       4|

I need to process rows one after the other based on the rank in the increasing order.我需要根据升序依次处理行。

This seems to be fine on single node machine.这在单节点机器上似乎很好。 But I see that the processing is getting jumbled when running on a multi-node cluster.但是我看到在多节点集群上运行时处理变得混乱。

How can I make sure that the order is guaranteed?如何确保订单有保障？

Answer 1

Could you please try with coalesce(1) followed by sort(cols:*) operation over the SECURITY_ID partitioned Datafame to get new Datafame/Dataset sorted by the specified columns, all in ascending order.您能否尝试在 SECURITY_ID 分区 Datafame 上使用 coalesce(1) 然后 sort(cols:*) 操作，以获取按指定列排序的新 Datafame/Dataset，所有这些都按升序排列。

df.coalesce(1).sort("RANK").foreach(row => process(row))

Spark SQL：是否可以为给定的数据帧分区以预定义的顺序逐行处理？

问题描述

1 个解决方案

解决方案1
0 2019-09-25 02:53:03

Spark SQL：是否可以为给定的数据帧分区以预定义的顺序逐行处理？

问题描述

1 个解决方案

解决方案1 0 2019-09-25 02:53:03

解决方案1
0 2019-09-25 02:53:03