[英]How to know which partition is currently running while using foreachPartition() function in pyspark?
我需要將分區保存到每個分區具有不同名稱的文本文件中。 但是在下面的代碼片段下運行時,只有一個文件通過覆蓋前一個分區來保存。
def chunks(iterator):
chunks.counter += 1
l = (list(iterator))
df = pd.DataFrame(l,index=None)
df.to_csv(parent_path+"C"+str(chunks.counter+1)+".txt", header=None, index=None, sep=' ')
chunks.counter=0
sc.parallelize([1,2,3,4,5,6],num_partions).foreachPartition(chunks)
有什么方法可以知道 pySpark 當前正在運行哪個分區?
def chunks(lst, n):
# Yield successive n-sized chunks from the lst...
for i in (range(0, len(lst), n)):
yield i, lst[i:i + n]
for (index, values) in chunks(range(0, 1e5), 1e3): # change this to int's as per your need otherwise it will give float error or will write range obj itself..
with open(f"{parent_path}_C_{index}.txt", "w") as output:
output.write(str(values)) # converting to str
你甚至可以輕松地將它包裝到joblib中;)在我看來,我們不需要PySpark ..
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.