簡體 English 中英

如何使用Spark隨機播放大文件？

[英]How to Use Spark to Shuffle big files?

原文 2017-08-07 06:30:40 4 2 apache-spark/ shuffle

我有一個20G的文本文件，我想改組其行。 由於本地內存的限制，我想立即執行此操作。 有人可以告訴我該怎么做嗎？

PS我考慮過使用密鑰對(random.random(), line) ，因此spark在保存文件時將按random.random()排序。 我不知道這行得通。

2 個解決方案

假設使用Python，並假設您的文本文件RDD稱為lines ，請嘗試以下操作：

shuffled_RDD = lines.map(lambda line: (random.random(), line)).sortByKey.map(lambda line: line[1:])

這未經測試，但是邏輯應該起作用。

一個簡單的解決方案是將文件讀取為數據框，然后使用orderBy ；

import org.apache.spark.sql.functions.rand
val shuffledDF = df.orderBy(rand())

這將隨機化數據幀行的順序。 之后，只需再次另存為文本文件即可。

如果執行器失敗，Spark with External Shuffle Service 可以使用保存的 shuffle 文件嗎？

[英]Can Spark with External Shuffle Service use saved shuffle files in the event of executor failure?

如何在Spark數據幀中混洗行？

[英]How to shuffle the rows in a Spark dataframe?

Spark隨機播放操作如何工作？

[英]How Spark shuffle operation works?

有沒有辦法在Spark中洗牌集合

[英]Is there way how to shuffle collection in Spark

Spark-隨機播放“打開的文件太多”

[英]Spark - “too many open files” in shuffle

Spark cache/persist vs shuffle files

[英]Spark cache/persist vs shuffle files

Spark worker在臨時shuffle文件上拋出FileNotFoundException

[英]Spark worker throws FileNotFoundException on temporary shuffle files

如何在spark中加載大文件（json或csv）一次

[英]how to load big files ( json or csv ) in spark once

如何使用GzipCodec或BZip2Codec通過Spark Shell進行隨機溢出壓縮

[英]How to use GzipCodec or BZip2Codec for shuffle spill compression with Spark shell

如何使用 Spark 進行大型模型的機器學習工作流

[英]How to use Spark for machine learning workflows with big models

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如果執行器失敗，Spark with External Shuffle Service 可以使用保存的 shuffle 文件嗎？如何在Spark數據幀中混洗行？ Spark隨機播放操作如何工作？有沒有辦法在Spark中洗牌集合 Spark-隨機播放“打開的文件太多” Spark cache/persist vs shuffle files Spark worker在臨時shuffle文件上拋出FileNotFoundException 如何在spark中加載大文件（json或csv）一次如何使用GzipCodec或BZip2Codec通過Spark Shell進行隨機溢出壓縮如何使用 Spark 進行大型模型的機器學習工作流

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM