How to Use Spark to Shuffle big files?

Question

I have a 20G text file and I want to shuffle its lines. Due to the limitation of local memory I want to do it on spark. Can someone tell me how to do it?

PS I considered using key-pair (random.random(), line) , so spark will sort by random.random() as it saves the file. I do not know does it work.

Answer 1

Assuming Python, and assuming your text file RDD is called lines , try this:

shuffled_RDD = lines.map(lambda line: (random.random(), line)).sortByKey.map(lambda line: line[1:])

This is not tested, but the logic should work.

Answer 2

A simple solution would be to read the file as a dataframe and then use orderBy ;

import org.apache.spark.sql.functions.rand
val shuffledDF = df.orderBy(rand())

This will randomize the order of the dataframe rows. After this, simply save as a text file again.

How to Use Spark to Shuffle big files?

Question

2 answers

solution1
0 2017-08-07 06:41:49

solution2
0 2017-08-07 08:12:39

How to Use Spark to Shuffle big files?

Question

2 answers

solution1 0 2017-08-07 06:41:49

solution2 0 2017-08-07 08:12:39

solution1
0 2017-08-07 06:41:49

solution2
0 2017-08-07 08:12:39