I have a 20G text file and I want to shuffle its lines. Due to the limitation of local memory I want to do it on spark. Can someone tell me how to do it?
PS I considered using key-pair (random.random(), line)
, so spark will sort by random.random()
as it saves the file. I do not know does it work.
Assuming Python, and assuming your text file RDD is called lines
, try this:
shuffled_RDD = lines.map(lambda line: (random.random(), line)).sortByKey.map(lambda line: line[1:])
This is not tested, but the logic should work.
A simple solution would be to read the file as a dataframe and then use orderBy
;
import org.apache.spark.sql.functions.rand
val shuffledDF = df.orderBy(rand())
This will randomize the order of the dataframe rows. After this, simply save as a text file again.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.