简体   繁体   English

尝试通过数据框在Pyspark中执行用户定义的函数时出错

[英]Error when trying to execute User Defined Functions in Pyspark over a Dataframe

I am creating a little program in Pyspark in which I want to generate a Used Defined Function to call 'method1' from a lambda function into 'method0'. 我正在Pyspark中创建一个小程序,我想在其中生成一个Used Defined Function,以将lambda函数中的method1调用为method0。

I simplified de code to a better understanding, but the core functionality would be: for each instance in a dataframe, the 'method0' applies 'method1' (with the help of a lambda function) to return a value according to the values of the instance that is inspecting have. 为了简化理解,我简化了代码,但是核心功能是:对于数据帧中的每个实例,“ method0”应用“ method1”(借助lambda函数)以根据正在检查的实例。 This way, if the first condition of 'method1' is met, the value for that instance should be '-', but if not, it shoud be 'other'. 这样,如果满足“方法1”的第一个条件,则该实例的值应为“-”,但如果不满足,则应为“其他”。

With those operations, the idea is to get a column from that UDF and attach it to the dataframe in 'method0'. 通过这些操作,我们的想法是从该UDF获取一列并将其附加到“ method0”中的数据帧。 Here is the modified code for you to easier understand it: 这是修改后的代码,可让您更容易理解:

def method1(atr_list, instance, ident):

    if(instance.ATR1 != '-'):
        return instance.ATR1
    else:
        # Other operations ...
        return 'other'

def method0(df, atr_example_list, ident):

    udf_func = udf(lambda instance: method1(atr_example_list, instance, ident), returnType=StringType())
    new_column = udf_func(df)
    df = df.withColumnRenamed("New_Column", new_column)
    return df

result = method0(df, list, "1111")

But when I execute this code, I get the next error and I don't really know why: 但是,当我执行此代码时,会收到下一个错误,我真的不知道为什么:

Py4JError: An error occurred while calling o298.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Here is an example of input and the output I expect: 这是我期望的输入和输出示例:

Dataframe 'df': 数据框“ df”:

+-------+-------+-------+
| ATR1  |  ATR2 | ATRN  |
+-------+-------+-------+
| '-'   |   1   |  'a'  |
| '-'   |   1   |  'a'  |
| '-'   |   2   |  'b'  | 
| '++'  |   1   |  'a'  |
+-------+-------+-------+

Passing the dataframe 'df' as a parameter to 'method0' (not necessary to take a look at parameters 'atr_example_list' and 'ident' for this simplified example) I want to get a column like this on 'method1' calling: 将数据框“ df”作为参数传递给“ method0”(对于此简化示例,不必查看参数“ atr_example_list”和“ ident”),我想在“ method1”调用中获得像这样的列:

+------------+
| new_column |
+------------+
|   'other'  |
|   'other'  |
|   'other'  |
|    '++'    |
+------------+

So on method0, the new dataframe would be: 因此在method0上,新的数据帧将是:

+-------+-------+-------+------------+
| ATR1  |  ATR2 | ATRN  | new_column |
+-------+-------+-------+------------+
| '-'   |   1   |  'a'  |   'other'  |
| '-'   |   1   |  'a'  |   'other'  |
| '-'   |   2   |  'b'  |   'other'  | 
| '++'  |   1   |  'a'  |    '++'    |
+-------+-------+-------+------------+

Could anyone help me? 有人可以帮我吗?

Can't you simplify and use a single udf like that (method1 can take more than one column if necessary) ? 您不能像这样简化和使用单个udf(如果需要,method1可以占用多个列)? :

def method1(x):
  if x != "-":
    return x
  else:
    return 'other'

u_method1 = udf(method1, StringType())

result = df.withColumn("new_column", u_method1("ATR1"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM