如何在 Spark DataFrame 上将基于 function 的列添加到 Pandas？

Question

I would like to run udf on Pandas on Spark dataframe.我想在 Spark dataframe 上的 Pandas 上运行 udf。 I thought it should be easy but having tough time figuring it out.我认为这应该很容易，但很难弄清楚。

For example, consider my psdf (Pandas Spark DataFrame)例如，考虑我的 psdf (Pandas Spark DataFrame)

     name     p1     p2
0     AAA    1.0    1.0
1     BBB    1.0    1.0

I have a simple function,我有一个简单的 function，

def f(a:float, b:float) -> float:
   return math.sqrt(a**2 + b**2)

I expect below psdf,我希望低于 psdf，

     name     p1     p2  R
0     AAA    1.0    1.0  1.4
1     BBB    1.0    1.0  1.4

The function is quite dynamic and I showed only a sample here. function 非常动态，我在这里只展示了一个示例。 For example, there is another function with 3 arguments.例如，还有一个 function 和 3 个 arguments。

I tried below code but getting an error on not set compute.ops_on_diff_frames parameter and document says it is expensive.我尝试了下面的代码，但在未设置compute.ops_on_diff_frames参数时出现错误，并且文档说它很昂贵。 Hence, want to avoid it.因此，想避免它。

psdf["R"] = psdf[["p1","p2"]].apply(lambda x: f(*x), axis=1)

Note: I saw one can convert to normal spark dataframe and use withColumn but not sure if it will have performance penality注意：我看到一个可以转换为普通火花 dataframe 并使用withColumn但不确定它是否会有性能损失

Any suggestions?有什么建议么？

Answer 1

You can convert to a spark df and apply a pandas_udf.您可以转换为 spark df 并应用 pandas_udf。 Converting from/to koalas has minor overhead compared to applying a python udf . 与应用 python udf 相比，从/到考拉的转换开销较小。 Also you should look at using a pandas_udf which is more efficient than row based udfs .此外，您应该考虑使用比基于行的 udfs 更有效的 pandas_udf 。

Answer 2

As a developer, I am learning go for the simplest available way to solve your problem.作为一名开发人员，我正在学习 go 以获得解决问题的最简单方法。 Otherwise you create yourself problems now and in the future.否则，您现在和将来都会给自己制造问题。 pandas udfs are useful in getting to use pandas functionality missing in pyspark. pandas udf 可用于使用 pyspark 中缺少的 pandas 功能。 In this case all you need is available in pyspark.在这种情况下，您所需的一切都在 pyspark 中可用。 Rememeber apply lambda is an anti climax in pandas.记住应用 lambda 是 pandas 中的反高潮。 I suggest you use higher order functions.我建议你使用高阶函数。 Logic and code below下面的逻辑和代码

new = (df.withColumn('x', array(*[x for x in df.columns if x!='id']))#Create an array of columns except id
       .withColumn('y',expr("sqrt(reduce(x,cast(0 as double), (c,i)-> (c*c+i*i)))"))#Use high order functions to square and find squareroot
      ).show()

如何在 Spark DataFrame 上将基于 function 的列添加到 Pandas？

问题描述

2 个解决方案

解决方案1
0 2022-09-15 20:03:32

解决方案2
0 2022-09-17 06:18:46

如何在 Spark DataFrame 上将基于 function 的列添加到 Pandas？

问题描述

2 个解决方案

解决方案1 0 2022-09-15 20:03:32

解决方案2 0 2022-09-17 06:18:46

解决方案1
0 2022-09-15 20:03:32

解决方案2
0 2022-09-17 06:18:46