使用 PySpark 或 pandas_on_spark 计算新列

Question

I have this piece of code (below) that works perfectly on Pandas, but it is computationally too expensive to convert a large Dataframe to Pandas just for doing this operation.我有这段代码（如下）可以在 Pandas 上完美运行，但是仅仅为了执行此操作而将大型 Dataframe 转换为 Pandas 的计算成本太高。 I'm looking for alternative solution how to do this in pandas-on-spark.我正在寻找如何在 pandas-on-spark 中执行此操作的替代解决方案。

new_value = sum(df[col1]*df[col2])/sum(df[col2])

With pandas-on-spark I got following error: > PandasNotImplementedError: The method pd.Series.__iter__() is not implemented.使用 pandas-on-spark 时出现以下错误： > PandasNotImplementedError: 方法pd.Series.__iter__()未实现。

Answer 1

you can do it in spark Dataframe你可以在 spark Dataframe 中做到这一点

from pyspark.sql.functions import sum as _sum
>>> df=df.withColumn("multiply",col("col1")*col("col2"))
>>> df.show(5)
+----+-----+--------+
|col1| col2|multiply|
+----+-----+--------+
| 001|   10|    10.0|
| 001|   10|    10.0|
| 002|   11|    22.0|
| 002|11878| 23756.0|
| 002|  117|   234.0|
+----+-----+--------+
only showing top 5 rows

>>> df.select(_sum('multiply')/_sum('col2')).show()
+---------------------------+
|(sum(multiply) / sum(col2))|
+---------------------------+
|          2.994753471275567|
+---------------------------+

使用 PySpark 或 pandas_on_spark 计算新列

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-06-09 17:22:36

使用 PySpark 或 pandas_on_spark 计算新列

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-06-09 17:22:36

解决方案1
0 已采纳 2022-06-09 17:22:36