简体   繁体   English

使用 PySpark 或 pandas_on_spark 计算新列

[英]Calculate a new column with PySpark or pandas_on_spark

I have this piece of code (below) that works perfectly on Pandas, but it is computationally too expensive to convert a large Dataframe to Pandas just for doing this operation.我有这段代码(如下)可以在 Pandas 上完美运行,但是仅仅为了执行此操作而将大型 Dataframe 转换为 Pandas 的计算成本太高。 I'm looking for alternative solution how to do this in pandas-on-spark.我正在寻找如何在 pandas-on-spark 中执行此操作的替代解决方案。

new_value = sum(df[col1]*df[col2])/sum(df[col2])

With pandas-on-spark I got following error: > PandasNotImplementedError: The method pd.Series.__iter__() is not implemented.使用 pandas-on-spark 时出现以下错误: > PandasNotImplementedError: 方法pd.Series.__iter__()未实现。

you can do it in spark Dataframe你可以在 spark Dataframe 中做到这一点

from pyspark.sql.functions import sum as _sum
>>> df=df.withColumn("multiply",col("col1")*col("col2"))
>>> df.show(5)
+----+-----+--------+
|col1| col2|multiply|
+----+-----+--------+
| 001|   10|    10.0|
| 001|   10|    10.0|
| 002|   11|    22.0|
| 002|11878| 23756.0|
| 002|  117|   234.0|
+----+-----+--------+
only showing top 5 rows

>>> df.select(_sum('multiply')/_sum('col2')).show()
+---------------------------+
|(sum(multiply) / sum(col2))|
+---------------------------+
|          2.994753471275567|
+---------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM