I just realized that I can do following in Scala
val df = spark.read.csv("test.csv")
val df1=df.map(x=>x(0).asInstanceOf[String].toLowerCase)
However in Python
if I try to call map
function on DataFrame
it will throw me error.
df = spark.read.csv("Downloads/test1.csv")
df.map(lambda x: x[1].lower())
Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.4.3/libexec/python/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'
In Python
I need to explicitly convert Dataframe
to RDD
.
my question is why, I need to do this in case of python?
Is this the different in Spark API implementation or Scala
implicityly converts DataFrame
to RDD
back and again to DataFrame
Python Dataframe API doesn't have map function due to how the Python API works.
Python, everytime that you convert to RDD or uses a UDF with the Python API you are creating a python call during your execution.
What that means? That means, during the Spark execution instead of all the data be processed inside of the JVM with Scala code generated (Dataframe API) the JVM need to call the Python code to apply the logic you created. That by default creates a HUGE overhead during the execution.
So the solution for Python is building an API that will block the usage of python code and will only use Scala generated code using the DataFrame pipeline.
This will help to understand how UDFs with python works, that basically is really close how RDD maps will work with Python: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.