简体   繁体   English

如何在 Spark DataFrame 上将基于 function 的列添加到 Pandas?

[英]How to add a column based on a function to Pandas on Spark DataFrame?

I would like to run udf on Pandas on Spark dataframe.我想在 Spark dataframe 上的 Pandas 上运行 udf。 I thought it should be easy but having tough time figuring it out.我认为这应该很容易,但很难弄清楚。

For example, consider my psdf (Pandas Spark DataFrame)例如,考虑我的 psdf (Pandas Spark DataFrame)

     name     p1     p2
0     AAA    1.0    1.0
1     BBB    1.0    1.0

I have a simple function,我有一个简单的 function,

def f(a:float, b:float) -> float:
   return math.sqrt(a**2 + b**2)

I expect below psdf,我希望低于 psdf,

     name     p1     p2  R
0     AAA    1.0    1.0  1.4
1     BBB    1.0    1.0  1.4

The function is quite dynamic and I showed only a sample here. function 非常动态,我在这里只展示了一个示例。 For example, there is another function with 3 arguments.例如,还有一个 function 和 3 个 arguments。

I tried below code but getting an error on not set compute.ops_on_diff_frames parameter and document says it is expensive.我尝试了下面的代码,但在未设置compute.ops_on_diff_frames参数时出现错误,并且文档说它很昂贵。 Hence, want to avoid it.因此,想避免它。

psdf["R"] = psdf[["p1","p2"]].apply(lambda x: f(*x), axis=1)

Note: I saw one can convert to normal spark dataframe and use withColumn but not sure if it will have performance penality注意:我看到一个可以转换为普通火花 dataframe 并使用withColumn但不确定它是否会有性能损失

Any suggestions?有什么建议么?

You can convert to a spark df and apply a pandas_udf.您可以转换为 spark df 并应用 pandas_udf。 Converting from/to koalas has minor overhead compared to applying a python udf . 与应用 python udf 相比,从/到考拉的转换开销较小 Also you should look at using a pandas_udf which is more efficient than row based udfs .此外,您应该考虑使用比基于行的 udfs 更有效的 pandas_udf

As a developer, I am learning go for the simplest available way to solve your problem.作为一名开发人员,我正在学习 go 以获得解决问题的最简单方法。 Otherwise you create yourself problems now and in the future.否则,您现在和将来都会给自己制造问题。 pandas udfs are useful in getting to use pandas functionality missing in pyspark. pandas udf 可用于使用 pyspark 中缺少的 pandas 功能。 In this case all you need is available in pyspark.在这种情况下,您所需的一切都在 pyspark 中可用。 Rememeber apply lambda is an anti climax in pandas.记住应用 lambda 是 pandas 中的反高潮。 I suggest you use higher order functions.我建议你使用高阶函数。 Logic and code below下面的逻辑和代码

new = (df.withColumn('x', array(*[x for x in df.columns if x!='id']))#Create an array of columns except id
       .withColumn('y',expr("sqrt(reduce(x,cast(0 as double), (c,i)-> (c*c+i*i)))"))#Use high order functions to square and find squareroot
      ).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我们可以在Spark DataFrame列中使用Pandas函数吗? 如果是这样,怎么样? - Can we use a Pandas function in a Spark DataFrame column ? If so, how? 如何根据另一列的时间将列添加到pandas数据框 - How to add a column to pandas dataframe based on time from another column 如何根据 Pandas 数据框中的另一列值添加列? - How to add column based on another column value in Pandas dataframe? 如何基于groupby函数输出向熊猫数据框添加新列? - How can I add a new column to pandas dataframe based on groupby function output? 如何在 pandas dataframe 中添加基于日期条件的值的列? - How to add a column with value based on date condition in pandas dataframe? 如何基于熊猫数据框中的行条件添加新列? - How to add new column based on row condition in pandas dataframe? 如何根据条件将级别添加到 pandas dataframe 中的新列? - how to add levels to a new column in pandas dataframe based on a condition? 如何根据条件结果将列添加到Pandas DataFrame - How to add a column to pandas DataFrame based on the result of a condition 如何根据不同列中的值向 pandas dataframe 添加一列? - How to add one column to pandas dataframe based on values in different columns? 如何正确地将基于向量的 function 应用于 pandas dataframe 列? - how to properly apply a vector based function to a pandas dataframe column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM