`df.loc`的pyspark等价物？

Question

I am looking for pyspark equivalence of pandas dataframe.我正在寻找 Pandas 数据框的 pyspark 等效项。 In particular, I want to do the following operation on pyspark dataframe特别是，我想对pyspark数据帧做以下操作

# in pandas dataframe, I can do the following operation
# assuming df = pandas dataframe
index = df['column_A'] > 0.0
amount = sum(df.loc[index, 'column_B'] * df.loc[index, 'column_C']) 
        / sum(df.loc[index, 'column_C'])

I am wondering what is the pyspark equivalence of doing this to the pyspark dataframe?我想知道对 pyspark 数据框执行此操作的 pyspark 等效性是什么？

Answer 1

Spark DataFrame don't have strict order so indexing is not meaningful. Spark DataFrame没有严格的顺序，因此索引没有意义。 Instead we use SQL-like DSL.相反，我们使用类似 SQL 的 DSL。 Here you'd use where ( filter ) and select .在这里，您将使用where ( filter ) 并select 。 If data looked like this:如果数据如下所示：

import pandas as pd
import numpy as np
from pyspark.sql.functions import col, sum as sum_

np.random.seed(1)

df = pd.DataFrame({
   c: np.random.randn(1000) for c in ["column_A", "column_B", "column_C"]
})

amount would be amount将是

amount
# 0.9334143225687774

and Spark equivalent is:和 Spark 等效的是：

sdf = spark.createDataFrame(df)

(amount_, ) = (sdf
    .where(sdf.column_A > 0.0)
    .select(sum_(sdf.column_B * sdf.column_C) / sum_(sdf.column_C))
    .first())

and results are numerically equivalent:结果在数值上是等效的：

abs(amount - amount_)
# 1.1102230246251565e-16

You could also use conditionals:您还可以使用条件：

from pyspark.sql.functions import when

pred = col("column_A") > 0.0

amount_expr = sum_(
  when(pred, col("column_B")) * when(pred, col("column_C"))
) / sum_(when(pred, col("column_C")))

sdf.select(amount_expr).first()[0]
# 0.9334143225687773

which look more Pandas-like, but are more verbose.看起来更像熊猫，但更冗长。

Answer 2

This is simple enough to do with the RDD (I'm not as familiar with spark.sql.DataFrame ):这对于RDD很简单（我对spark.sql.DataFrame不太熟悉）：

x, y = (df.rdd
        .filter(lambda x: x.column_A > 0.0)
        .map(lambda x: (x.column_B*x.column_C, x.column_C))
        .reduce(lambda x, y: (x[0]+y[0], x[1]+y[1])))
amount = x / y

Or filter the DataFrame then jump into the RDD :或者filter DataFrame然后跳转到RDD ：

x, y = (df
        .filter(df.column_A > 0.0)
        .rdd
        .map(lambda x: (x.column_B*x.column_C, x.column_C))
        .reduce(lambda x, y: (x[0]+y[0], x[1]+y[1])))
amount = x / y

After a little digging, not sure this is the most efficient way to do it but without stepping into the RDD :经过一番挖掘，不确定这是最有效的方法，但无需进入RDD ：

x, y = (df
        .filter(df.column_A > 0.0)
        .select((df.column_B * df.column_C).alias("product"), df.column_C)
        .agg({'product': 'sum', 'column_C':'sum'})).first()
amount = x / y

Answer 3

More Pysparky answer which is fast更多 Pysparky 答案很快

import pyspark.sql.functions as f
sdf=sdf.withColumn('sump',f.when(f.col('colA')>0,f.col('colB')*f.col('colC')).otherwise(0))
z=sdf.select(f.sum(f.col('sump'))/f.sum(f.col('colA'))).collect()
print(z[0])

`df.loc`的pyspark等价物？

问题描述

3 个解决方案

解决方案1
3 已采纳 2018-05-13 08:43:35

解决方案2
2 2018-05-13 01:12:31

解决方案3
0 2020-07-23 14:13:38

`df.loc`的pyspark等价物？

问题描述

3 个解决方案

解决方案1 3 已采纳 2018-05-13 08:43:35

解决方案2 2 2018-05-13 01:12:31

解决方案3 0 2020-07-23 14:13:38

解决方案1
3 已采纳 2018-05-13 08:43:35

解决方案2
2 2018-05-13 01:12:31

解决方案3
0 2020-07-23 14:13:38