[英]When I need to use lambda (and when not) while creating a UDF Pyspark?
I do not completely understand when I need to use a lambda function in the definition of a UDF.我不完全理解何时需要在 UDF 的定义中使用 lambda 函数。
My prior understanding was that I needed lambda in order for the DataFrame to recognize that it has to iterate over each row but I have seen many applications of UDFs without a lambda expression.我之前的理解是,我需要 lambda 才能让 DataFrame 认识到它必须遍历每一行,但我已经看到许多没有 lambda 表达式的 UDF 应用程序。
For example:例如:
I have a silly function that works well like this without using lambda:我有一个愚蠢的函数,它在不使用 lambda 的情况下运行良好:
@udf("string")
def unknown_city(s, city):
if s == 'KS' and 'MI':
return 'Unknown'
else:
return city
display(df2.
withColumn("new_city", unknown_city(col('geo.state'), col('geo.city')))
)
How can I make it work with lambda?我怎样才能让它与 lambda 一起工作? Is it necessary?有必要吗?
Python lambda is just a way to write your functions. Python lambda 只是编写函数的一种方式。 See the example code below and you will see they're pretty much the same, except that the lambda function is only for one-line code.请参阅下面的示例代码,您将看到它们几乎相同,只是 lambda 函数仅用于一行代码。
With lambda function使用 lambda 函数
from pyspark.sql import functions as F
from pyspark.sql import types as T
df.withColumn('num+1', F.udf(lambda num: num + 1, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+1|
# +---+-----+
# | 10| 11|
# | 20| 21|
# +---+-----+
With normal function具有正常功能
from pyspark.sql import functions as F
from pyspark.sql import types as T
def numplus2(num):
return num + 2
df.withColumn('num+2', F.udf(numplus2, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+2|
# +---+-----+
# | 10| 12|
# | 20| 22|
# +---+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.