[英]convert sql statement to pyspark
I have created a sql code that I want to convert into pyspark code.我创建了一个 sql 代码,我想将其转换为 pyspark 代码。 Except for one thing, it works.除了一件事,它有效。 But how can I best insert the sum function.但是我怎样才能最好地插入总和 function。
SELECT r_date, abc_code, sum(case when kpi_id=1234 then value else null end) as XXX,
sum(case when kpi_id=5678 then value else null end) as YYY from rate
WHERE abc_code = 'AS55' AND org_id = '12-3'
GROUP BY r_date, abc_code
ORDER BY r_date DESC, abc_code;
Pyspark code Pyspark代码
rate_df = rate_df.select(
'org_id',
'abc_code',
'value',
'r_date',
expr("case when kpi_id == '1234' then value else ' ' end").alias('XXX'),
expr("case when kpi_id == '5678' then value else ' ' end").alias('YYY')
) \
.where((F.col('abc_code') == 'AS55') &
(F.col('organisation_id') == '12-3'))
How can I insert the sum function in pyspark to get the values in one row.如何在 pyspark 中插入总和 function 以获得一行中的值。
The follow code I have implemented:我已经实现的以下代码:
rate_df = rate_df.select(
F.col('creation_date').alias('r_date'),
'organisation_id',
'b_employee',
'abc_code',
'kpi_date',
'kpi_id',
'value'
)
nh_rate_df = rate_df.where(
(F.col('abc_code') == 'AS55') &
(F.col('organisation_id') == '12-3')
).groupBy(
'organisation_id', 'r_date', 'b_employee', 'kpi_date', 'abc_code'
).agg(
F.sum(F.when(F.col('kpi_id') == 1234, F.col('value'))).alias('xxx'),
F.sum(F.when(F.col('kpi_id') == 5678, F.col('value'))).alias('YYY'),
).orderBy(
F.desc('kpi_date'), F.col('abc_code')
)
nh_rate_df = nh_rate_df.join(s_function_df, 'abc_code', 'left')
nh_rate_df = nh_rate_df.join(hst_df, 'organisation_id', 'left')
I get a result that doesn't summarize all of the matching lines.我得到的结果没有总结所有匹配的行。
r_date r_date | kpi_date kpi_date | organisation_id组织 ID | abc_code abc_code | b_empl b_empl | XXX XXX | YYY年年 |
---|---|---|---|---|---|---|
2020-12-02 2020-12-02 | 2020-11-01 00:00:00 2020-11-01 00:00:00 | 12-3 12-3 | AS55 AS55 | A一个 | 1.0000 1.0000 | null null |
2020-12-02 2020-12-02 | 2020-11-01 00:00:00 2020-11-01 00:00:00 | 12-3 12-3 | AS55 AS55 | null null | null null | 1.0000 1.0000 |
2020-11-02 2020-11-02 | 2020-10-01 00:00:00 2020-10-01 00:00:00 | 12-3 12-3 | AS55 AS55 | A一个 | null null | 1.0000 1.0000 |
2020-11-02 2020-11-02 | 2020-10-01 00:00:00 2020-10-01 00:00:00 | 12-3 12-3 | AS55 AS55 | null null | 1.0000 1.0000 | null null |
2020-10-02 2020-10-02 | 2020-09-01 00:00:00 2020-09-01 00:00:00 | 12-3 12-3 | AS55 AS55 | A一个 | 2.0000 2.0000 | null null |
2020-10-02 2020-10-02 | 2020-09-01 00:00:00 2020-09-01 00:00:00 | 12-3 12-3 | AS55 AS55 | null null | null null | 1.0000 1.0000 |
2020-09-22 2020-09-22 | 2020-08-01 00:00:00 2020-08-01 00:00:00 | 12-3 12-3 | AS55 AS55 | null null | 1.0000 1.0000 | 1.0000 1.0000 |
2020-09-22 2020-09-22 | 2020-08-01 00:00:00 2020-08-01 00:00:00 | 12-3 12-3 | AS55 AS55 | A一个 | null null | null null |
although i have the same date.虽然我有相同的日期。 can it be due to the join?可能是因为加入吗?
This result Iam getting if I run the sql code如果我运行 sql 代码,我会得到这个结果
r_date r_date | kpi_date kpi_date | organisation_id组织 ID | abc_code abc_code | b_empl b_empl | XXX XXX | YYY年年 |
---|---|---|---|---|---|---|
2020-12-02 2020-12-02 | 2020-11-01 00:00:00 2020-11-01 00:00:00 | 12-3 12-3 | AS55 AS55 | A一个 | 1.0000 1.0000 | 2.0000 2.0000 |
2020-11-02 2020-11-02 | 2020-10-01 00:00:00 2020-10-01 00:00:00 | 12-3 12-3 | AS55 AS55 | A一个 | 1.0000 1.0000 | 2.0000 2.0000 |
2020-10-02 2020-10-02 | 2020-09-01 00:00:00 2020-09-01 00:00:00 | 12-3 12-3 | AS55 AS55 | A一个 | 2.0000 2.0000 | 1.0000 1.0000 |
2020-09-22 2020-09-22 | 2020-08-01 00:00:00 2020-08-01 00:00:00 | 12-3 12-3 | AS55 AS55 | null null | 2.0000 2.0000 | 1.0000 1.0000 |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.