I am creating a little program in Pyspark in which I want to generate a Used Defined Function to call 'method1' from a lambda function into 'method0'.
I simplified de code to a better understanding, but the core functionality would be: for each instance in a dataframe, the 'method0' applies 'method1' (with the help of a lambda function) to return a value according to the values of the instance that is inspecting have. This way, if the first condition of 'method1' is met, the value for that instance should be '-', but if not, it shoud be 'other'.
With those operations, the idea is to get a column from that UDF and attach it to the dataframe in 'method0'. Here is the modified code for you to easier understand it:
def method1(atr_list, instance, ident):
if(instance.ATR1 != '-'):
return instance.ATR1
else:
# Other operations ...
return 'other'
def method0(df, atr_example_list, ident):
udf_func = udf(lambda instance: method1(atr_example_list, instance, ident), returnType=StringType())
new_column = udf_func(df)
df = df.withColumnRenamed("New_Column", new_column)
return df
result = method0(df, list, "1111")
But when I execute this code, I get the next error and I don't really know why:
Py4JError: An error occurred while calling o298.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Here is an example of input and the output I expect:
Dataframe 'df':
+-------+-------+-------+
| ATR1 | ATR2 | ATRN |
+-------+-------+-------+
| '-' | 1 | 'a' |
| '-' | 1 | 'a' |
| '-' | 2 | 'b' |
| '++' | 1 | 'a' |
+-------+-------+-------+
Passing the dataframe 'df' as a parameter to 'method0' (not necessary to take a look at parameters 'atr_example_list' and 'ident' for this simplified example) I want to get a column like this on 'method1' calling:
+------------+
| new_column |
+------------+
| 'other' |
| 'other' |
| 'other' |
| '++' |
+------------+
So on method0, the new dataframe would be:
+-------+-------+-------+------------+
| ATR1 | ATR2 | ATRN | new_column |
+-------+-------+-------+------------+
| '-' | 1 | 'a' | 'other' |
| '-' | 1 | 'a' | 'other' |
| '-' | 2 | 'b' | 'other' |
| '++' | 1 | 'a' | '++' |
+-------+-------+-------+------------+
Could anyone help me?
Can't you simplify and use a single udf like that (method1 can take more than one column if necessary) ? :
def method1(x):
if x != "-":
return x
else:
return 'other'
u_method1 = udf(method1, StringType())
result = df.withColumn("new_column", u_method1("ATR1"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.