简体   繁体   English

Spark SQL:是否可以为给定的数据帧分区以预定义的顺序逐行处理?

[英]Spark SQL : Is it possible to process row after row in an pre-defined order for a given data frame partition?

I am looking to partition my data frame(df) based on a column (SECURITY_ID) and then run df.foreachpartition(customfunction).我希望根据列 (SECURITY_ID) 对我的数据框 (df) 进行分区,然后运行 df.foreachpartition(customfunction)。 This is working fine.这工作正常。

Inside each partition, based on a column (RANK) the data has to be ordered.在每个分区内,必须根据列 (RANK) 对数据进行排序。 This is working fine.这工作正常。

Now, based on the the order I want to process row after row in sequential order for each partition.现在,根据我想为每个分区按顺序逐行处理的顺序。 For example -例如 -

Base dataframe:底座 dataframe:

+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  32934789|      290X2|   -98763|       3|
  3S534789|      290X2|    45300|       2|
  3FA34789|      290X2|    12763|       1|
  00000019|      290X2|-10177400|       4|
  92115301|      35G71|     8003|       2|
  91615301|      35G71|    -2883|       1|

After partition and order by分区和排序后

+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  3FA34789|      290X2|    12763|       1|  
  3S534789|      290X2|    45300|       2|
  32934789|      290X2|   -98763|       3|
  00000019|      290X2|-10177400|       4|



+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  91615301|      35G71|    -2883|       1|
  92115301|      35G71|     8003|       2|

Let us consider this partition让我们考虑这个分区

+-------------+----------+----------+------
ACCOUNT_NO|SECURITY_ID| QUANTITY|    RANK|
+-------------+----------+----------+------
  3FA34789|      290X2|    12763|       1|  
  3S534789|      290X2|    45300|       2|
  32934789|      290X2|   -98763|       3|
  00000019|      290X2|-10177400|       4|

I need to process rows one after the other based on the rank in the increasing order.我需要根据升序依次处理行。

This seems to be fine on single node machine.这在单节点机器上似乎很好。 But I see that the processing is getting jumbled when running on a multi-node cluster.但是我看到在多节点集群上运行时处理变得混乱。

How can I make sure that the order is guaranteed?如何确保订单有保障?

Could you please try with coalesce(1) followed by sort(cols:*) operation over the SECURITY_ID partitioned Datafame to get new Datafame/Dataset sorted by the specified columns, all in ascending order.您能否尝试在 SECURITY_ID 分区 Datafame 上使用 coalesce(1) 然后 sort(cols:*) 操作,以获取按指定列排序的新 Datafame/Dataset,所有这些都按升序排列。

df.coalesce(1).sort("RANK").foreach(row => process(row))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 分区列上的 Spark SQL 无需读取全行数据 - Spark SQL on partition columns without reading full row data 如何在不更改顺序的情况下逐行读取数据帧? 在Spark Scala中 - How to read Data Frame row by row without changing Order ? in Spark Scala Postgres SQL 等效于带有 row_number 的 spark window.partition - Postgres SQL equivalent of spark window.partition with row_number 如何使用 Spark SQL 在循环时将迭代的行记录保存到新的数据框或列表? - How to save the iterated row record to new data frame or list while looping, using Spark SQL? “未定义类型(值:org.apache.spark.sql.Row)Unit的类ForeachWriter中的方法过程”对ForeachWriter意味着什么? - What does “method process in class ForeachWriter of type (value: org.apache.spark.sql.Row)Unit is not defined” mean with ForeachWriter? 使用scala火花数据框操作行和列级别 - spark data frame operation row and column level useing scala 如何在Scala / Spark数据框中的每一行使用带有条件的withColumn - How to use withColumn with condition for the each row in Scala / Spark data frame 如何使用Spark数据帧将行数据帧转换为数组Json输出 - How to convert Row Dataframe to Array Json Output with Spark Data Frame spark数据框,将行值转换为列名 - spark data frame converting row values into column name 处理每个分区和每个分区中的每一行,一次一个 - Process each partition and each row in each partition, one at a time
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM