简体   繁体   English

如何按行将函数应用于 PySpark 数据帧的一组列?

[英]How to apply a function to a set of columns of a PySpark dataframe by rows?

Given a dataframe like:给定一个数据框,如:

   A0  A1  A2  A3
0   9   1   2   8
1   9   7   6   9
2   1   7   4   6
3   0   8   4   8
4   0   1   6   0
5   7   1   4   3
6   6   3   5   9
7   3   3   2   8
8   6   3   0   8
9   3   2   7   1

I need to apply a function to a set of the columns row by row to create a new column with the results of this function.我需要将一个函数逐行应用于一组列,以使用此函数的结果创建一个新列。

An example in Pandas is: Pandas 中的一个例子是:

df = pd.DataFrame(data=None, columns=['A0', 'A1', 'A2', 'A3'])
df['A0'] = np.random.randint(0, 10, 10)
df['A1'] = np.random.randint(0, 10, 10)
df['A2'] = np.random.randint(0, 10, 10)
df['A3'] = np.random.randint(0, 10, 10)

df['mean'] = df.mean(axis=1)
df['std'] = df.iloc[:, :-1].std(axis=1)
df['any'] = df.iloc[:, :-2].apply(lambda x: np.sum(x), axis=1)

And the results is:结果是:

   A0  A1  A2  A3  mean       std  any
0   9   1   2   8  5.00  4.082483   20
1   9   7   6   9  7.75  1.500000   31
2   1   7   4   6  4.50  2.645751   18
3   0   8   4   8  5.00  3.829708   20
4   0   1   6   0  1.75  2.872281    7
5   7   1   4   3  3.75  2.500000   15
6   6   3   5   9  5.75  2.500000   23
7   3   3   2   8  4.00  2.708013   16
8   6   3   0   8  4.25  3.500000   17
9   3   2   7   1  3.25  2.629956   13

How can I do something similar in PySpark?如何在 PySpark 中做类似的事情?

For Spark 2.4+, you can use aggregate function.对于 Spark 2.4+,您可以使用aggregate函数。 First, create array columns values using all the dataframe columns.首先,使用所有数据框列创建数组列values Then, calculate std , means and any columns like this:然后,像这样计算stdmeansany列:

  • any : aggregate to sum the array elements any :聚合以求和数组元素
  • mean : divide any column by the size of the array values mean :将any列除以数组values的大小
  • std : aggregate and sum (x - mean) ** 2 then divide by the length - 1 of the array std : 聚合和总和(x - mean) ** 2然后除以数组的length - 1

Here is the associated code :这是相关的代码:

from pyspark.sql.functions import expr, sqrt, size, col, array

data = [
    (9, 1, 2, 8), (9, 7, 6, 9), (1, 7, 4, 6),
    (0, 8, 4, 8), (0, 1, 6, 0), (7, 1, 4, 3),
    (6, 3, 5, 9), (3, 3, 2, 8), (6, 3, 0, 8),
    (3, 2, 7, 1)
]
df = spark.createDataFrame(data, ['A0', 'A1', 'A2', 'A3'])

cols = df.columns

df.withColumn("values", array(*cols)) \
  .withColumn("any", expr("aggregate(values, 0D, (acc, x) -> acc + x)")) \
  .withColumn("mean", col("any") / size(col("values"))) \
  .withColumn("std", sqrt(expr("""aggregate(values, 0D, 
                                           (acc, x) -> acc + power(x - mean, 2), 
                                           acc -> acc / (size(values) -1))"""
                              )
                         )) \
  .drop("values") \
  .show(truncate=False)

#+---+---+---+---+----+----+------------------+
#|A0 |A1 |A2 |A3 |any |mean|std               |
#+---+---+---+---+----+----+------------------+
#|9  |1  |2  |8  |20.0|5.0 |4.08248290463863  |
#|9  |7  |6  |9  |31.0|7.75|1.5               |
#|1  |7  |4  |6  |18.0|4.5 |2.6457513110645907|
#|0  |8  |4  |8  |20.0|5.0 |3.8297084310253524|
#|0  |1  |6  |0  |7.0 |1.75|2.8722813232690143|
#|7  |1  |4  |3  |15.0|3.75|2.5               |
#|6  |3  |5  |9  |23.0|5.75|2.5               |
#|3  |3  |2  |8  |16.0|4.0 |2.70801280154532  |
#|6  |3  |0  |8  |17.0|4.25|3.5               |
#|3  |2  |7  |1  |13.0|3.25|2.6299556396765835|
#+---+---+---+---+----+----+------------------+

Spark < 2.4 :火花 < 2.4

You can use functools.reduce and operator.add to sum the columns.您可以使用functools.reduceoperator.add对列求和。 The logic remains the same as above:逻辑和上面一样:

from functools import reduce
from operator import add

df.withColumn("any", reduce(add, [col(c) for c in cols])) \
  .withColumn("mean", col("any") / len(cols)) \
  .withColumn("std", sqrt(reduce(add, [(col(c) - col("mean")) ** 2 for c in cols]) / (len(cols) -1)))\
  .show(truncate=False)

The above answer is great, however I see the OP is using Python/PySpark and if you don't understand Spark SQL expressions the above logic is not 100% clear.上面的答案很好,但是我看到 OP 正在使用 Python/PySpark,如果您不理解 Spark SQL 表达式,则上述逻辑不是 100% 清楚。

I would suggest using a Pandas UDAF, unlike UDF's these are vectorized and very efficient.我建议使用 Pandas UDAF,与 UDF 不同的是,它们是矢量化的并且非常有效。 This has been added to the Spark API to lower the learning curve needed to migrate from pandas to Spark.这已添加到 Spark API 中,以降低从 Pandas 迁移到 Spark 所需的学习曲线。 This also means that your code is more maintainable if most of your colleagues, like mine, are more familiar with Pandas/Python.这也意味着,如果您的大多数同事(例如我的同事)更熟悉 Pandas/Python,则您的代码更易于维护。

These are the types of Pandas UDAF's available and their Pandas equivalent这些是可用的 Pandas UDAF 类型及其等效的 Pandas

Eg SparkUdafType → df.pandasEquivalent(...) works on → returns例如 SparkUdafType → df.pandasEquivalent(...) 作用于 → 返回

SCALAR → df.transform(...), Mapping Series → Series SCALAR → df.transform(...), 映射系列 → 系列

GROUPED_MAP → df.apply(...) , Group & MapDataFrame → DataFrame GROUPED_MAP → df.apply(...) , Group & MapDataFrame → DataFrame

GROUPED_AGG → df.aggregate(...), Reduce Series → Scalar GROUPED_AGG → df.aggregate(...), Reduce Series → Scalar

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM