[英]Spark - is map function available for Dataframe or just RDD?
我剛剛意識到我可以在Scala
中進行以下操作
val df = spark.read.csv("test.csv")
val df1=df.map(x=>x(0).asInstanceOf[String].toLowerCase)
但是在Python
中,如果我嘗試在 ZBA834BA059A9A379459C112 上調用map
DataFrame
,它將拋出錯誤。
df = spark.read.csv("Downloads/test1.csv")
df.map(lambda x: x[1].lower())
錯誤
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.4.3/libexec/python/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'
在Python
我需要將Dataframe
顯式轉換為RDD
。
我的問題是為什么,在 python 的情況下我需要這樣做?
這是 Spark API 實現或Scala
中的不同之處嗎?隱式地將DataFrame
轉換為RDD
,然后再轉換為DataFrame
Python Dataframe API doesn't have map function due to how the Python API works.
Python, everytime that you convert to RDD or uses a UDF with the Python API you are creating a python call during your execution.
那是什么意思? That means, during the Spark execution instead of all the data be processed inside of the JVM with Scala code generated (Dataframe API) the JVM need to call the Python code to apply the logic you created. 默認情況下,這會在執行期間產生巨大的開銷。
So the solution for Python is building an API that will block the usage of python code and will only use Scala generated code using the DataFrame pipeline.
This will help to understand how UDFs with python works, that basically is really close how RDD maps will work with Python: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.