如何使用Spark随机播放大文件？

Question

I have a 20G text file and I want to shuffle its lines. 我有一个20G的文本文件，我想改组其行。 Due to the limitation of local memory I want to do it on spark. 由于本地内存的限制，我想立即执行此操作。 Can someone tell me how to do it? 有人可以告诉我该怎么做吗？

PS I considered using key-pair (random.random(), line) , so spark will sort by random.random() as it saves the file. PS我考虑过使用密钥对(random.random(), line) ，因此spark在保存文件时将按random.random()排序。 I do not know does it work. 我不知道这行得通。

Answer 1

Assuming Python, and assuming your text file RDD is called lines , try this: 假设使用Python，并假设您的文本文件RDD称为lines ，请尝试以下操作：

shuffled_RDD = lines.map(lambda line: (random.random(), line)).sortByKey.map(lambda line: line[1:])

This is not tested, but the logic should work. 这未经测试，但是逻辑应该起作用。

Answer 2

A simple solution would be to read the file as a dataframe and then use orderBy ; 一个简单的解决方案是将文件读取为数据框，然后使用orderBy ；

import org.apache.spark.sql.functions.rand
val shuffledDF = df.orderBy(rand())

This will randomize the order of the dataframe rows. 这将随机化数据帧行的顺序。 After this, simply save as a text file again. 之后，只需再次另存为文本文件即可。

如何使用Spark随机播放大文件？

问题描述

2 个解决方案

解决方案1
0 2017-08-07 06:41:49

解决方案2
0 2017-08-07 08:12:39

如何使用Spark随机播放大文件？

问题描述

2 个解决方案

解决方案1 0 2017-08-07 06:41:49

解决方案2 0 2017-08-07 08:12:39

解决方案1
0 2017-08-07 06:41:49

解决方案2
0 2017-08-07 08:12:39