简体   繁体   中英

Error when trying to execute User Defined Functions in Pyspark over a Dataframe

I am creating a little program in Pyspark in which I want to generate a Used Defined Function to call 'method1' from a lambda function into 'method0'.

I simplified de code to a better understanding, but the core functionality would be: for each instance in a dataframe, the 'method0' applies 'method1' (with the help of a lambda function) to return a value according to the values of the instance that is inspecting have. This way, if the first condition of 'method1' is met, the value for that instance should be '-', but if not, it shoud be 'other'.

With those operations, the idea is to get a column from that UDF and attach it to the dataframe in 'method0'. Here is the modified code for you to easier understand it:

def method1(atr_list, instance, ident):

    if(instance.ATR1 != '-'):
        return instance.ATR1
    else:
        # Other operations ...
        return 'other'

def method0(df, atr_example_list, ident):

    udf_func = udf(lambda instance: method1(atr_example_list, instance, ident), returnType=StringType())
    new_column = udf_func(df)
    df = df.withColumnRenamed("New_Column", new_column)
    return df

result = method0(df, list, "1111")

But when I execute this code, I get the next error and I don't really know why:

Py4JError: An error occurred while calling o298.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Here is an example of input and the output I expect:

Dataframe 'df':

+-------+-------+-------+
| ATR1  |  ATR2 | ATRN  |
+-------+-------+-------+
| '-'   |   1   |  'a'  |
| '-'   |   1   |  'a'  |
| '-'   |   2   |  'b'  | 
| '++'  |   1   |  'a'  |
+-------+-------+-------+

Passing the dataframe 'df' as a parameter to 'method0' (not necessary to take a look at parameters 'atr_example_list' and 'ident' for this simplified example) I want to get a column like this on 'method1' calling:

+------------+
| new_column |
+------------+
|   'other'  |
|   'other'  |
|   'other'  |
|    '++'    |
+------------+

So on method0, the new dataframe would be:

+-------+-------+-------+------------+
| ATR1  |  ATR2 | ATRN  | new_column |
+-------+-------+-------+------------+
| '-'   |   1   |  'a'  |   'other'  |
| '-'   |   1   |  'a'  |   'other'  |
| '-'   |   2   |  'b'  |   'other'  | 
| '++'  |   1   |  'a'  |    '++'    |
+-------+-------+-------+------------+

Could anyone help me?

Can't you simplify and use a single udf like that (method1 can take more than one column if necessary) ? :

def method1(x):
  if x != "-":
    return x
  else:
    return 'other'

u_method1 = udf(method1, StringType())

result = df.withColumn("new_column", u_method1("ATR1"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM