简体   繁体   English

如何取消 pyspark foreachPartition 操作

[英]How to cancel pyspark foreachPartition operation

How can I cancel a long pyspark foreachPartition operation?如何取消长 pyspark foreachPartition 操作?

For example I have my code that handles a very large amount of data (and it take a long time) but I want to be able to allow the user to cancel the operation - how do I do it?例如,我的代码可以处理大量数据(并且需要很长时间),但我希望能够允许用户取消操作 - 我该怎么做?

def get_data(self, spark_session):
    query = 'Some query...'
    my_data_frame = spark_session.sql(query)
    my_data_frame.foreachPartition(handle_data)
    # How to cancel on user request?

It can be done using可以使用

sc = spark_session.sparkContext
sc.setJobGroup(...)
# In a separate thread:
sc.cancelJobGroup(...)

There is a full example in PySpark API documentation PySpark API 文档中有一个完整的示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM