簡體 English 中英

緩存有序Spark DataFrame會創建不需要的作業

[英]Caching ordered Spark DataFrame creates unwanted job

原文 2017-03-22 12:41:35 0 1 python/ apache-spark/ pyspark/ apache-spark-sql/ pyspark-sql

我想將RDD轉換為DataFrame並想要緩存RDD的結果：

from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn

schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])

df = spark.createDataFrame(
    sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
    schema=schema,
    verifySchema=False
).orderBy("t") #.cache()

如果不使用cache功能，則不會生成任何作業。
如果使用cache只有經過orderBy 1組的工作是為生成cache ：
如果僅在parallelize后才使用cache ，則不會生成任何作業。

為什么cache在這種情況下生成作業？ 如何避免cache的作業生成（緩存DataFrame而不是RDD）？

編輯：我調查了更多的問題，發現沒有orderBy("t")沒有生成任務。 為什么？

1 個解決方案

我提交了一張錯誤機票，因為以下原因關閉了：

緩存需要支持RDD。 這需要我們也知道支持分區，這對於全局訂單來說有點特殊：它觸發作業（掃描），因為我們需要確定分區邊界。

緩存Spark Dataframe以提高速度

[英]Caching Spark Dataframe for speed enhancement

Spark 工作未結束：dataframe 的展示

[英]Spark job not ending : Show of dataframe

紗線集群上的火花創建一個火花作業，其工人數量遠小於火花上下文中指定的數量

[英]spark on yarn cluster creates a spark job with the number of workers that is much smaller than what is specified in the spark context

烏龜會創建一個額外的屏幕（不需要）

[英]Turtle creates an extra screen (unwanted)

打印語句會創建多余的換行符

[英]Print statement creates an unwanted newline

熊貓數據幀到有序詞典

[英]panda dataframe to ordered dictionary

丟棄數據框中不需要的點

[英]Discarding unwanted points in a dataframe

Python-to_dict（）創建不需要的嵌套字典

[英]Python - to_dict() creates unwanted nested dictionary

matplotlib分類條形圖創建不必要的空格

[英]matplotlib categorical bar chart creates unwanted whitespace

嵌套列表重疊會產生不必要的間隙

[英]Overlap of nested lists creates unwanted gap

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 緩存Spark Dataframe以提高速度 Spark 工作未結束：dataframe 的展示紗線集群上的火花創建一個火花作業，其工人數量遠小於火花上下文中指定的數量烏龜會創建一個額外的屏幕（不需要）打印語句會創建多余的換行符熊貓數據幀到有序詞典丟棄數據框中不需要的點 Python-to_dict（）創建不需要的嵌套字典 matplotlib分類條形圖創建不必要的空格嵌套列表重疊會產生不必要的間隙

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM