简体   繁体   中英

Spark - is map function available for Dataframe or just RDD?

I just realized that I can do following in Scala

val df = spark.read.csv("test.csv")
val df1=df.map(x=>x(0).asInstanceOf[String].toLowerCase)

However in Python if I try to call map function on DataFrame it will throw me error.

df = spark.read.csv("Downloads/test1.csv")
df.map(lambda x: x[1].lower())

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.4.3/libexec/python/pyspark/sql/dataframe.py", line 1300, in __getattr__
    "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'

In Python I need to explicitly convert Dataframe to RDD .

my question is why, I need to do this in case of python?

Is this the different in Spark API implementation or Scala implicityly converts DataFrame to RDD back and again to DataFrame

Python Dataframe API doesn't have map function due to how the Python API works.

Python, everytime that you convert to RDD or uses a UDF with the Python API you are creating a python call during your execution.

What that means? That means, during the Spark execution instead of all the data be processed inside of the JVM with Scala code generated (Dataframe API) the JVM need to call the Python code to apply the logic you created. That by default creates a HUGE overhead during the execution.

So the solution for Python is building an API that will block the usage of python code and will only use Scala generated code using the DataFrame pipeline.

This will help to understand how UDFs with python works, that basically is really close how RDD maps will work with Python: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM