[英]How to Use Spark to Shuffle big files?
I have a 20G text file and I want to shuffle its lines. 我有一个20G的文本文件,我想改组其行。 Due to the limitation of local memory I want to do it on spark.
由于本地内存的限制,我想立即执行此操作。 Can someone tell me how to do it?
有人可以告诉我该怎么做吗?
PS I considered using key-pair (random.random(), line)
, so spark will sort by random.random()
as it saves the file. PS我考虑过使用密钥对
(random.random(), line)
,因此spark在保存文件时将按random.random()
排序。 I do not know does it work. 我不知道这行得通。
Assuming Python, and assuming your text file RDD is called lines
, try this: 假设使用Python,并假设您的文本文件RDD称为
lines
,请尝试以下操作:
shuffled_RDD = lines.map(lambda line: (random.random(), line)).sortByKey.map(lambda line: line[1:])
This is not tested, but the logic should work. 这未经测试,但是逻辑应该起作用。
A simple solution would be to read the file as a dataframe and then use orderBy
; 一个简单的解决方案是将文件读取为数据框,然后使用
orderBy
;
import org.apache.spark.sql.functions.rand
val shuffledDF = df.orderBy(rand())
This will randomize the order of the dataframe rows. 这将随机化数据帧行的顺序。 After this, simply save as a text file again.
之后,只需再次另存为文本文件即可。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.