简体   繁体   English

`df.loc`的pyspark等价物?

[英]pyspark equivalence of `df.loc`?

I am looking for pyspark equivalence of pandas dataframe.我正在寻找 Pandas 数据框的 pyspark 等效项。 In particular, I want to do the following operation on pyspark dataframe特别是,我想对pyspark数据帧做以下操作

# in pandas dataframe, I can do the following operation
# assuming df = pandas dataframe
index = df['column_A'] > 0.0
amount = sum(df.loc[index, 'column_B'] * df.loc[index, 'column_C']) 
        / sum(df.loc[index, 'column_C'])

I am wondering what is the pyspark equivalence of doing this to the pyspark dataframe?我想知道对 pyspark 数据框执行此操作的 pyspark 等效性是什么?

Spark DataFrame don't have strict order so indexing is not meaningful. Spark DataFrame没有严格的顺序,因此索引没有意义。 Instead we use SQL-like DSL.相反,我们使用类似 SQL 的 DSL。 Here you'd use where ( filter ) and select .在这里,您将使用where ( filter ) 并select If data looked like this:如果数据如下所示:

import pandas as pd
import numpy as np
from pyspark.sql.functions import col, sum as sum_

np.random.seed(1)

df = pd.DataFrame({
   c: np.random.randn(1000) for c in ["column_A", "column_B", "column_C"]
})

amount would be amount将是

amount
# 0.9334143225687774

and Spark equivalent is:和 Spark 等效的是:

sdf = spark.createDataFrame(df)

(amount_, ) = (sdf
    .where(sdf.column_A > 0.0)
    .select(sum_(sdf.column_B * sdf.column_C) / sum_(sdf.column_C))
    .first())

and results are numerically equivalent:结果在数值上是等效的:

abs(amount - amount_)
# 1.1102230246251565e-16

You could also use conditionals:您还可以使用条件:

from pyspark.sql.functions import when

pred = col("column_A") > 0.0

amount_expr = sum_(
  when(pred, col("column_B")) * when(pred, col("column_C"))
) / sum_(when(pred, col("column_C")))

sdf.select(amount_expr).first()[0]
# 0.9334143225687773

which look more Pandas-like, but are more verbose.看起来更像熊猫,但更冗长。

This is simple enough to do with the RDD (I'm not as familiar with spark.sql.DataFrame ):这对于RDD很简单(我对spark.sql.DataFrame不太熟悉):

x, y = (df.rdd
        .filter(lambda x: x.column_A > 0.0)
        .map(lambda x: (x.column_B*x.column_C, x.column_C))
        .reduce(lambda x, y: (x[0]+y[0], x[1]+y[1])))
amount = x / y

Or filter the DataFrame then jump into the RDD :或者filter DataFrame然后跳转到RDD

x, y = (df
        .filter(df.column_A > 0.0)
        .rdd
        .map(lambda x: (x.column_B*x.column_C, x.column_C))
        .reduce(lambda x, y: (x[0]+y[0], x[1]+y[1])))
amount = x / y

After a little digging, not sure this is the most efficient way to do it but without stepping into the RDD :经过一番挖掘,不确定这是最有效的方法,但无需进入RDD

x, y = (df
        .filter(df.column_A > 0.0)
        .select((df.column_B * df.column_C).alias("product"), df.column_C)
        .agg({'product': 'sum', 'column_C':'sum'})).first()
amount = x / y

More Pysparky answer which is fast更多 Pysparky 答案很快

import pyspark.sql.functions as f
sdf=sdf.withColumn('sump',f.when(f.col('colA')>0,f.col('colB')*f.col('colC')).otherwise(0))
z=sdf.select(f.sum(f.col('sump'))/f.sum(f.col('colA'))).collect()
print(z[0])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM