[英]How to filter rows for a specific aggregate with spark sql?
Normally all rows in a group are passed to an aggregate function. 通常,组中的所有行都将传递给聚合函数。 I would like to filter rows using a condition so that only some rows within a group are passed to an aggregate function.
我想使用条件过滤行,以便只将组中的某些行传递给聚合函数。 Such operation is possible with PostgreSQL .
PostgreSQL可以实现这样的操作。 I would like to do the same thing with Spark SQL DataFrame (Spark 2.0.0).
我想用Spark SQL DataFrame(Spark 2.0.0)做同样的事情。
The code could probably look like this: 代码可能看起来像这样:
val df = ... // some data frame
df.groupBy("A").agg(
max("B").where("B").less(10), // there is no such method as `where` :(
max("C").where("C").less(5)
)
So for a data frame like this: 所以对于像这样的数据框:
| A | B | C |
| 1| 14| 4|
| 1| 9| 3|
| 2| 5| 6|
The result would be: 结果将是:
|A|max(B)|max(C)|
|1| 9| 4|
|2| 5| null|
Is it possible with Spark SQL? 是否可以使用Spark SQL?
Note that in general any other aggregate function than max
could be used and there could be multiple aggregates over the same column with arbitrary filtering conditions. 请注意,通常可以使用除
max
之外的任何其他聚合函数,并且在具有任意过滤条件的同一列上可能存在多个聚合。
val df = Seq(
(1,14,4),
(1,9,3),
(2,5,6)
).toDF("a","b","c")
val aggregatedDF = df.groupBy("a")
.agg(
max(when($"b" < 10, $"b")).as("MaxB"),
max(when($"c" < 5, $"c")).as("MaxC")
)
aggregatedDF.show
>>> df = sc.parallelize([[1,14,1],[1,9,3],[2,5,6]]).map(lambda t: Row(a=int(t[0]),b=int(t[1]),c=int(t[2]))).toDF()
>>> df.registerTempTable('t')
>>> res = sqlContext.sql("select a,max(case when b<10 then b else null end) mb,max(case when c<5 then c else null end) mc from t group by a")
+---+---+----+
| a| mb| mc|
+---+---+----+
| 1| 9| 3|
| 2| 5|null|
+---+---+----+
You can use sql (I believe you do the same thing in Postgres?) 你可以使用sql(我相信你在Postgres做同样的事情?)
df.groupBy("name","age","id").agg(functions.max("age").$less(20),functions.max("id").$less("30")).show();
Sample Data: 样本数据:
name age id
abc 23 1001
cde 24 1002
efg 22 1003
ghi 21 1004
ijk 20 1005
klm 19 1006
mno 18 1007
pqr 18 1008
rst 26 1009
tuv 27 1010
pqr 18 1012
rst 28 1013
tuv 29 1011
abc 24 1015
Output: 输出:
+----+---+----+---------------+--------------+
|name|age| id|(max(age) < 20)|(max(id) < 30)|
+----+---+----+---------------+--------------+
| rst| 26|1009| false| true|
| abc| 23|1001| false| true|
| ijk| 20|1005| false| true|
| tuv| 29|1011| false| true|
| efg| 22|1003| false| true|
| mno| 18|1007| true| true|
| tuv| 27|1010| false| true|
| klm| 19|1006| true| true|
| cde| 24|1002| false| true|
| pqr| 18|1008| true| true|
| abc| 24|1015| false| true|
| ghi| 21|1004| false| true|
| rst| 28|1013| false| true|
| pqr| 18|1012| true| true|
+----+---+----+---------------+--------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.