Spark - map function 可用於 Dataframe 還是僅 RDD？

Question

我剛剛意識到我可以在Scala中進行以下操作

val df = spark.read.csv("test.csv")
val df1=df.map(x=>x(0).asInstanceOf[String].toLowerCase)

但是在Python中，如果我嘗試在 ZBA834BA059A9A379459C112 上調用map DataFrame ，它將拋出錯誤。

df = spark.read.csv("Downloads/test1.csv")
df.map(lambda x: x[1].lower())

錯誤

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.4.3/libexec/python/pyspark/sql/dataframe.py", line 1300, in __getattr__
    "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'

在Python我需要將Dataframe顯式轉換為RDD 。

我的問題是為什么，在 python 的情況下我需要這樣做？

這是 Spark API 實現或Scala中的不同之處嗎？隱式地將DataFrame轉換為RDD ，然后再轉換為DataFrame

Answer 1

Python Dataframe API doesn't have map function due to how the Python API works.

Python, everytime that you convert to RDD or uses a UDF with the Python API you are creating a python call during your execution.

那是什么意思？ That means, during the Spark execution instead of all the data be processed inside of the JVM with Scala code generated (Dataframe API) the JVM need to call the Python code to apply the logic you created. 默認情況下，這會在執行期間產生巨大的開銷。

So the solution for Python is building an API that will block the usage of python code and will only use Scala generated code using the DataFrame pipeline.

This will help to understand how UDFs with python works, that basically is really close how RDD maps will work with Python: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9

Spark - map function 可用於 Dataframe 還是僅 RDD？

問題描述

1 個解決方案

解決方案1
1 2019-09-23 04:33:55

Spark - map function 可用於 Dataframe 還是僅 RDD？

問題描述

1 個解決方案

解決方案1 1 2019-09-23 04:33:55

解決方案1
1 2019-09-23 04:33:55