I have the below query in pyspark:
spark.sql= ("select id, track_id, data_source
from db.races
where dt_date = 20201010")
.groupBy("id", "track_id", "data_source")
.agg(cnt('*').alias("num_races"))
.withColumn('last_num_id', col('id').substr(-1,1))
.withColumn('last_num_track_id', col('track_id').substr(-1,1))
.withColumn("status_date", lit(previous_date))
And I want to convert it to impala query.
My attempt until now:
select id, track_id, data_source
from db.races
group by id, track_id, data_source
...
I can understand until the part of groupBy
but after that I can not understand exactly how these pyspark functions can be converted.
Not familiar with Impala, but here's my attempt at writing an SQL query:
select
t.*,
substr(t.id, -1, 1) as last_num_id,
substr(t.track_id, -1, 1) as last_num_track_id,
'(put the previous_date here)' as status_date
from (
select id, track_id, data_source, count(*) as num_races
from db.races
where dt_date = 20201010
group by id, track_id, data_source
) as t
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.