簡體   English   中英

Spark - map function 可用於 Dataframe 還是僅 RDD?

[英]Spark - is map function available for Dataframe or just RDD?

我剛剛意識到我可以在Scala中進行以下操作

val df = spark.read.csv("test.csv")
val df1=df.map(x=>x(0).asInstanceOf[String].toLowerCase)

但是在Python中,如果我嘗試在 ZBA834BA059A9A379459C112 上調用map DataFrame ,它將拋出錯誤。

df = spark.read.csv("Downloads/test1.csv")
df.map(lambda x: x[1].lower())

錯誤

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.4.3/libexec/python/pyspark/sql/dataframe.py", line 1300, in __getattr__
    "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'

Python我需要將Dataframe顯式轉換為RDD

我的問題是為什么,在 python 的情況下我需要這樣做?

這是 Spark API 實現或Scala中的不同之處嗎?隱式地將DataFrame轉換為RDD ,然后再轉換為DataFrame

Python Dataframe API doesn't have map function due to how the Python API works.

Python, everytime that you convert to RDD or uses a UDF with the Python API you are creating a python call during your execution.

那是什么意思? That means, during the Spark execution instead of all the data be processed inside of the JVM with Scala code generated (Dataframe API) the JVM need to call the Python code to apply the logic you created. 默認情況下,這會在執行期間產生巨大的開銷。

So the solution for Python is building an API that will block the usage of python code and will only use Scala generated code using the DataFrame pipeline.

This will help to understand how UDFs with python works, that basically is really close how RDD maps will work with Python: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM