简体   繁体   English

如何在 spark sql 中对分区进行求和?

[英]how do I do a sum over partition in spark sql?

spark sql is just different enough from the engines I use that it breaks all my code spark sql 与我使用的引擎完全不同,它破坏了我的所有代码

this statement这个说法

case when sum(flag = 'Y') over (partition by id) > 0
     then 'Y' else 'N' end as flag

is supposed to return Y if any flag field of a given id is Y and it doesn't work because sum function in spark can only take numeric types.如果给定 id 的任何标志字段为 Y 并且它不起作用,则应该返回 Y,因为 spark 中的 sum function 只能采用数字类型。 is there a workaround?有解决方法吗?

Your code is not valid SQL -- it happens to work in MySQL but not in most databases.您的代码无效 SQL - 它恰好在 MySQL 中有效,但在大多数数据库中无效。

The Standard SQL approach will work, using a CASE expression:使用CASE表达式,标准 SQL 方法将起作用:

(case when sum(case when flag = 'Y' then 1 else 0 end) over (partition by id) > 0
     then 'Y' else 'N'
 end) as flag

Or, assuming that flag only takes on the values of 'Y' and 'N' , you can simplify the logic to:或者,假设flag仅采用'Y''N'的值,您可以将逻辑简化为:

min(flag) over (partition by id) as flag

You can cast the Boolean flag = 'Y' to an integer in order to sum it up:您可以将 Boolean flag = 'Y'转换为 integer 以便总结:

case when sum(int(flag = 'Y')) over (partition by id) > 0
     then 'Y' else 'N' end as flag

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM