[英]How to apply a function to a set of columns of a PySpark dataframe by rows?
Given a dataframe like:给定一个数据框,如:
A0 A1 A2 A3
0 9 1 2 8
1 9 7 6 9
2 1 7 4 6
3 0 8 4 8
4 0 1 6 0
5 7 1 4 3
6 6 3 5 9
7 3 3 2 8
8 6 3 0 8
9 3 2 7 1
I need to apply a function to a set of the columns row by row to create a new column with the results of this function.我需要将一个函数逐行应用于一组列,以使用此函数的结果创建一个新列。
An example in Pandas is: Pandas 中的一个例子是:
df = pd.DataFrame(data=None, columns=['A0', 'A1', 'A2', 'A3'])
df['A0'] = np.random.randint(0, 10, 10)
df['A1'] = np.random.randint(0, 10, 10)
df['A2'] = np.random.randint(0, 10, 10)
df['A3'] = np.random.randint(0, 10, 10)
df['mean'] = df.mean(axis=1)
df['std'] = df.iloc[:, :-1].std(axis=1)
df['any'] = df.iloc[:, :-2].apply(lambda x: np.sum(x), axis=1)
And the results is:结果是:
A0 A1 A2 A3 mean std any
0 9 1 2 8 5.00 4.082483 20
1 9 7 6 9 7.75 1.500000 31
2 1 7 4 6 4.50 2.645751 18
3 0 8 4 8 5.00 3.829708 20
4 0 1 6 0 1.75 2.872281 7
5 7 1 4 3 3.75 2.500000 15
6 6 3 5 9 5.75 2.500000 23
7 3 3 2 8 4.00 2.708013 16
8 6 3 0 8 4.25 3.500000 17
9 3 2 7 1 3.25 2.629956 13
How can I do something similar in PySpark?如何在 PySpark 中做类似的事情?
For Spark 2.4+, you can use aggregate
function.对于 Spark 2.4+,您可以使用aggregate
函数。 First, create array columns values
using all the dataframe columns.首先,使用所有数据框列创建数组列values
。 Then, calculate std
, means
and any
columns like this:然后,像这样计算std
、 means
和any
列:
any
: aggregate to sum the array elements any
:聚合以求和数组元素mean
: divide any
column by the size of the array values
mean
:将any
列除以数组values
的大小std
: aggregate and sum (x - mean) ** 2
then divide by the length - 1
of the array std
: 聚合和总和(x - mean) ** 2
然后除以数组的length - 1
Here is the associated code :这是相关的代码:
from pyspark.sql.functions import expr, sqrt, size, col, array
data = [
(9, 1, 2, 8), (9, 7, 6, 9), (1, 7, 4, 6),
(0, 8, 4, 8), (0, 1, 6, 0), (7, 1, 4, 3),
(6, 3, 5, 9), (3, 3, 2, 8), (6, 3, 0, 8),
(3, 2, 7, 1)
]
df = spark.createDataFrame(data, ['A0', 'A1', 'A2', 'A3'])
cols = df.columns
df.withColumn("values", array(*cols)) \
.withColumn("any", expr("aggregate(values, 0D, (acc, x) -> acc + x)")) \
.withColumn("mean", col("any") / size(col("values"))) \
.withColumn("std", sqrt(expr("""aggregate(values, 0D,
(acc, x) -> acc + power(x - mean, 2),
acc -> acc / (size(values) -1))"""
)
)) \
.drop("values") \
.show(truncate=False)
#+---+---+---+---+----+----+------------------+
#|A0 |A1 |A2 |A3 |any |mean|std |
#+---+---+---+---+----+----+------------------+
#|9 |1 |2 |8 |20.0|5.0 |4.08248290463863 |
#|9 |7 |6 |9 |31.0|7.75|1.5 |
#|1 |7 |4 |6 |18.0|4.5 |2.6457513110645907|
#|0 |8 |4 |8 |20.0|5.0 |3.8297084310253524|
#|0 |1 |6 |0 |7.0 |1.75|2.8722813232690143|
#|7 |1 |4 |3 |15.0|3.75|2.5 |
#|6 |3 |5 |9 |23.0|5.75|2.5 |
#|3 |3 |2 |8 |16.0|4.0 |2.70801280154532 |
#|6 |3 |0 |8 |17.0|4.25|3.5 |
#|3 |2 |7 |1 |13.0|3.25|2.6299556396765835|
#+---+---+---+---+----+----+------------------+
Spark < 2.4 :火花 < 2.4 :
You can use functools.reduce
and operator.add
to sum the columns.您可以使用functools.reduce
和operator.add
对列求和。 The logic remains the same as above:逻辑和上面一样:
from functools import reduce
from operator import add
df.withColumn("any", reduce(add, [col(c) for c in cols])) \
.withColumn("mean", col("any") / len(cols)) \
.withColumn("std", sqrt(reduce(add, [(col(c) - col("mean")) ** 2 for c in cols]) / (len(cols) -1)))\
.show(truncate=False)
The above answer is great, however I see the OP is using Python/PySpark and if you don't understand Spark SQL expressions the above logic is not 100% clear.上面的答案很好,但是我看到 OP 正在使用 Python/PySpark,如果您不理解 Spark SQL 表达式,则上述逻辑不是 100% 清楚。
I would suggest using a Pandas UDAF, unlike UDF's these are vectorized and very efficient.我建议使用 Pandas UDAF,与 UDF 不同的是,它们是矢量化的并且非常有效。 This has been added to the Spark API to lower the learning curve needed to migrate from pandas to Spark.这已添加到 Spark API 中,以降低从 Pandas 迁移到 Spark 所需的学习曲线。 This also means that your code is more maintainable if most of your colleagues, like mine, are more familiar with Pandas/Python.这也意味着,如果您的大多数同事(例如我的同事)更熟悉 Pandas/Python,则您的代码更易于维护。
These are the types of Pandas UDAF's available and their Pandas equivalent这些是可用的 Pandas UDAF 类型及其等效的 Pandas
Eg SparkUdafType → df.pandasEquivalent(...) works on → returns例如 SparkUdafType → df.pandasEquivalent(...) 作用于 → 返回
SCALAR → df.transform(...), Mapping Series → Series SCALAR → df.transform(...), 映射系列 → 系列
GROUPED_MAP → df.apply(...) , Group & MapDataFrame → DataFrame GROUPED_MAP → df.apply(...) , Group & MapDataFrame → DataFrame
GROUPED_AGG → df.aggregate(...), Reduce Series → Scalar GROUPED_AGG → df.aggregate(...), Reduce Series → Scalar
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.