是否可以在Pyspark中繼承DataFrame？

Question

Pyspark的文檔顯示了從sqlContext ， sqlContext.read()和各種其他方法構造的sqlContext 。

（參見https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html ）

是否有可能將Dataframe子類化並獨立實例化它？ 我想為基本DataFrame類添加方法和功能。

Answer 1

這真的取決於你的目標。

從技術上講，這是可能的。 pyspark.sql.DataFrame只是一個普通的Python類。 如果需要，您可以擴展它或猴子補丁。

 from pyspark.sql import DataFrame class DataFrameWithZipWithIndex(DataFrame): def __init__(self, df): super(self.__class__, self).__init__(df._jdf, df.sql_ctx) def zipWithIndex(self): return (self.rdd .zipWithIndex() .map(lambda row: (row[1], ) + row[0]) .toDF(["_idx"] + self.columns))

用法示例：

 df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"]) with_zipwithindex = DataFrameWithZipWithIndex(df) isinstance(with_zipwithindex, DataFrame)

 True

 with_zipwithindex.zipWithIndex().show()

 +----+---+---+ |_idx|foo|bar| +----+---+---+ | 0| a| 1| +----+---+---+

實際上，你在這里做不了多少。 DataFrame是一個圍繞JVM對象的瘦包裝器，除了提供文檔字符串，將參數轉換為本機所需的表單，調用JVM方法以及在必要時使用Python適配器包裝結果之外，沒有多大幫助。
使用純Python代碼，您甚至無法靠近DataFrame / Dataset內部或修改其核心行為。 如果你正在尋找獨立的，Python只有Spark DataFrame實現它是不可能的。

是否可以在Pyspark中繼承DataFrame？

問題描述

1 個解決方案

解決方案1
8 已采納 2017-01-11 18:54:05

是否可以在Pyspark中繼承DataFrame？

問題描述

1 個解決方案

解決方案1 8 已采納 2017-01-11 18:54:05

解決方案1
8 已采納 2017-01-11 18:54:05