[英]How to add a column based on a function to Pandas on Spark DataFrame?
I would like to run udf on Pandas on Spark dataframe.我想在 Spark dataframe 上的 Pandas 上运行 udf。 I thought it should be easy but having tough time figuring it out.我认为这应该很容易,但很难弄清楚。
For example, consider my psdf (Pandas Spark DataFrame)例如,考虑我的 psdf (Pandas Spark DataFrame)
name p1 p2
0 AAA 1.0 1.0
1 BBB 1.0 1.0
I have a simple function,我有一个简单的 function,
def f(a:float, b:float) -> float:
return math.sqrt(a**2 + b**2)
I expect below psdf,我希望低于 psdf,
name p1 p2 R
0 AAA 1.0 1.0 1.4
1 BBB 1.0 1.0 1.4
The function is quite dynamic and I showed only a sample here. function 非常动态,我在这里只展示了一个示例。 For example, there is another function with 3 arguments.例如,还有一个 function 和 3 个 arguments。
I tried below code but getting an error on not set compute.ops_on_diff_frames
parameter and document says it is expensive.我尝试了下面的代码,但在未设置compute.ops_on_diff_frames
参数时出现错误,并且文档说它很昂贵。 Hence, want to avoid it.因此,想避免它。
psdf["R"] = psdf[["p1","p2"]].apply(lambda x: f(*x), axis=1)
Note: I saw one can convert to normal spark dataframe and use withColumn
but not sure if it will have performance penality注意:我看到一个可以转换为普通火花 dataframe 并使用withColumn
但不确定它是否会有性能损失
Any suggestions?有什么建议么?
You can convert to a spark df and apply a pandas_udf.您可以转换为 spark df 并应用 pandas_udf。 Converting from/to koalas has minor overhead compared to applying a python udf . 与应用 python udf 相比,从/到考拉的转换开销较小。 Also you should look at using a pandas_udf which is more efficient than row based udfs .此外,您应该考虑使用比基于行的 udfs 更有效的 pandas_udf 。
As a developer, I am learning go for the simplest available way to solve your problem.作为一名开发人员,我正在学习 go 以获得解决问题的最简单方法。 Otherwise you create yourself problems now and in the future.否则,您现在和将来都会给自己制造问题。 pandas udfs are useful in getting to use pandas functionality missing in pyspark. pandas udf 可用于使用 pyspark 中缺少的 pandas 功能。 In this case all you need is available in pyspark.在这种情况下,您所需的一切都在 pyspark 中可用。 Rememeber apply lambda is an anti climax in pandas.记住应用 lambda 是 pandas 中的反高潮。 I suggest you use higher order functions.我建议你使用高阶函数。 Logic and code below下面的逻辑和代码
new = (df.withColumn('x', array(*[x for x in df.columns if x!='id']))#Create an array of columns except id
.withColumn('y',expr("sqrt(reduce(x,cast(0 as double), (c,i)-> (c*c+i*i)))"))#Use high order functions to square and find squareroot
).show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.