![](/img/trans.png)
[英]Selecting rows with maximum value, combining WHERE. MAX, and CAST, in spark.sql
[英]Spark.sql Filter rows by MAX
下面是源文件的一部分,您可以想象它要大得多:
date,code1,postcode,cityname,total
2020-03-27,2011,X700,Curepipe,44
2020-03-29,2011,X700,Curepipe,44
2020-03-26,2011,X700,Curepipe,22
2020-03-27,2035,X920,vacoas,3
2020-03-25,2011,X920,vacoas,1
2020-03-24,2122,X760,souillac,22
2020-03-23,2122,X760,souillac,11
2020-03-22,2257,X760,souillac,10
2020-03-27,2480,X510,rosehill,21
2020-03-22,2035,X510,rosehill,7
2020-03-20,2035,X510,rosehill,3
在以下代码之后:
#Load data
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("source").getOrCreate()
dfcases = spark.read.format("csv").option("header", "true").load("sourcefile.csv")
dfcases.createOrReplaceTempView("tablecases")
spark.sql(XXXXXXXXXXXXX).show() #Mysql code to insert
我想得到这个结果:
Curepipe,X700,2020-03-27,44
Curepipe,X700,2020-03-29,44
souillac,X760,2020-03-24,22
rosehill,X510,2020-03-27,21
vacoas,X920,2020-03-27,3
目的是:
谢谢!
以下查询生成您想要的 output
SELECT cityname, postcode, date, COUNT(*) AS total
FROM tablecases
GROUP BY cityname, postcode, date
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC, date, cityname
db<>fiddle中的演示
您可以在请求中使用SQL window获得结果,如下所示:
SELECT
cityname,
postcode,
date,
total
FROM
(SELECT
cityname,
postcode,
date,
total,
MAX(total) OVER (PARTITION BY cityname ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_total
FROM tablecases)
WHERE max_total = total
ORDER BY max_total DESC, date, cityname
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.