简体   繁体   中英

PySpark: UDF is not executing on a dataframe

I am using PySpark in Jupyter on Azure. I am trying to test using a UDF on a dataframe however, the UDF is not executing.

My dataframe is created by:

users = sqlContext.sql("SELECT DISTINCT userid FROM FoodDiaryData")

I have confirmed this dataframe is populated with 100 rows. In the next cell I try to execute a simple udf.

def iterateMeals(user):
    print user

users.foreach(iterateMeals)

This produces no output. I would have expected each entry in the dataframe to have been printed. However, if I simply try iterateMeals('test') it will fire and print 'test'. I also tried using pyspark.sql.functions

from pyspark.sql.functions import udf

def iterateMeals(user):
    print user
f_iterateMeals = udf(iterateMeals,LongType())

users.foreach(f_iterateMeals)

When I try this, I receive the following error:

Py4JError: An error occurred while calling o461.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist

Can someone explain where I have went wrong? I will be needing to execute udfs inside the .foreach of dataframes for this application.

  1. You won't see an output because print is executed on worker nodes and goes to the respective output. See Why does foreach not bring anything to the driver program? for a complete explanation.

  2. foreach operates on a RDD not a DataFrame . UDFs are not valid in this context.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM