[英]Error when trying to execute User Defined Functions in Pyspark over a Dataframe
I am creating a little program in Pyspark in which I want to generate a Used Defined Function to call 'method1' from a lambda function into 'method0'. 我正在Pyspark中创建一个小程序,我想在其中生成一个Used Defined Function,以将lambda函数中的method1调用为method0。
I simplified de code to a better understanding, but the core functionality would be: for each instance in a dataframe, the 'method0' applies 'method1' (with the help of a lambda function) to return a value according to the values of the instance that is inspecting have. 为了简化理解,我简化了代码,但是核心功能是:对于数据帧中的每个实例,“ method0”应用“ method1”(借助lambda函数)以根据正在检查的实例。 This way, if the first condition of 'method1' is met, the value for that instance should be '-', but if not, it shoud be 'other'.
这样,如果满足“方法1”的第一个条件,则该实例的值应为“-”,但如果不满足,则应为“其他”。
With those operations, the idea is to get a column from that UDF and attach it to the dataframe in 'method0'. 通过这些操作,我们的想法是从该UDF获取一列并将其附加到“ method0”中的数据帧。 Here is the modified code for you to easier understand it:
这是修改后的代码,可让您更容易理解:
def method1(atr_list, instance, ident):
if(instance.ATR1 != '-'):
return instance.ATR1
else:
# Other operations ...
return 'other'
def method0(df, atr_example_list, ident):
udf_func = udf(lambda instance: method1(atr_example_list, instance, ident), returnType=StringType())
new_column = udf_func(df)
df = df.withColumnRenamed("New_Column", new_column)
return df
result = method0(df, list, "1111")
But when I execute this code, I get the next error and I don't really know why: 但是,当我执行此代码时,会收到下一个错误,我真的不知道为什么:
Py4JError: An error occurred while calling o298.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Here is an example of input and the output I expect: 这是我期望的输入和输出示例:
Dataframe 'df': 数据框“ df”:
+-------+-------+-------+
| ATR1 | ATR2 | ATRN |
+-------+-------+-------+
| '-' | 1 | 'a' |
| '-' | 1 | 'a' |
| '-' | 2 | 'b' |
| '++' | 1 | 'a' |
+-------+-------+-------+
Passing the dataframe 'df' as a parameter to 'method0' (not necessary to take a look at parameters 'atr_example_list' and 'ident' for this simplified example) I want to get a column like this on 'method1' calling: 将数据框“ df”作为参数传递给“ method0”(对于此简化示例,不必查看参数“ atr_example_list”和“ ident”),我想在“ method1”调用中获得像这样的列:
+------------+
| new_column |
+------------+
| 'other' |
| 'other' |
| 'other' |
| '++' |
+------------+
So on method0, the new dataframe would be: 因此在method0上,新的数据帧将是:
+-------+-------+-------+------------+
| ATR1 | ATR2 | ATRN | new_column |
+-------+-------+-------+------------+
| '-' | 1 | 'a' | 'other' |
| '-' | 1 | 'a' | 'other' |
| '-' | 2 | 'b' | 'other' |
| '++' | 1 | 'a' | '++' |
+-------+-------+-------+------------+
Could anyone help me? 有人可以帮我吗?
Can't you simplify and use a single udf like that (method1 can take more than one column if necessary) ? 您不能像这样简化和使用单个udf(如果需要,method1可以占用多个列)? :
:
def method1(x):
if x != "-":
return x
else:
return 'other'
u_method1 = udf(method1, StringType())
result = df.withColumn("new_column", u_method1("ATR1"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.