简体   繁体   中英

Is it possible to subclass DataFrame in Pyspark?

The documentation for Pyspark shows DataFrames being constructed from sqlContext , sqlContext.read() , and a variety of other methods.

(See https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html )

Is it possible to subclass Dataframe and instantiate it independently? I would like to add methods and functionality to the base DataFrame class.

It really depends on your goals.

  • Technically speaking it is possible. pyspark.sql.DataFrame is just a plain Python class. You can extend it or monkey-patch if you need.

     from pyspark.sql import DataFrame class DataFrameWithZipWithIndex(DataFrame): def __init__(self, df): super(self.__class__, self).__init__(df._jdf, df.sql_ctx) def zipWithIndex(self): return (self.rdd .zipWithIndex() .map(lambda row: (row[1], ) + row[0]) .toDF(["_idx"] + self.columns)) 

    Example usage:

     df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"]) with_zipwithindex = DataFrameWithZipWithIndex(df) isinstance(with_zipwithindex, DataFrame) 
     True 
     with_zipwithindex.zipWithIndex().show() 
     +----+---+---+ |_idx|foo|bar| +----+---+---+ | 0| a| 1| +----+---+---+ 
  • Practically speaking you won't be able to do much here. DataFrame is an thin wrapper around JVM object and doesn't do much beyond providing docstrings, converting arguments to the form required natively, calling JVM methods, and wrapping the results using Python adapters if necessary.

    With plain Python code you won't be able to even go near DataFrame / Dataset internals or modify its core behavior. If you're looking for standalone, Python only Spark DataFrame implementation it is not possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM