简体   繁体   English

将 sql 语句转换为 pyspark

[英]convert sql statement to pyspark

I have created a sql code that I want to convert into pyspark code.我创建了一个 sql 代码,我想将其转换为 pyspark 代码。 Except for one thing, it works.除了一件事,它有效。 But how can I best insert the sum function.但是我怎样才能最好地插入总和 function。

SELECT r_date, abc_code, sum(case when kpi_id=1234 then value else null end) as XXX, 
       sum(case when kpi_id=5678 then value else null end) as YYY from rate 
WHERE abc_code = 'AS55' AND org_id = '12-3' 
GROUP BY r_date, abc_code 
ORDER BY r_date DESC, abc_code;

Pyspark code Pyspark代码

rate_df = rate_df.select(
    'org_id',
    'abc_code',
    'value',
    'r_date',
    expr("case when kpi_id == '1234' then value else ' ' end").alias('XXX'),
    expr("case when kpi_id == '5678' then value else ' ' end").alias('YYY')
    ) \
    .where((F.col('abc_code') == 'AS55') &
           (F.col('organisation_id') == '12-3'))

How can I insert the sum function in pyspark to get the values in one row.如何在 pyspark 中插入总和 function 以获得一行中的值。

The follow code I have implemented:我已经实现的以下代码:

rate_df = rate_df.select(
    F.col('creation_date').alias('r_date'),
    'organisation_id',
    'b_employee',
    'abc_code',
    'kpi_date',
    'kpi_id',
    'value'
    )
nh_rate_df = rate_df.where(
        (F.col('abc_code') == 'AS55') &
        (F.col('organisation_id') == '12-3')
     ).groupBy(
        'organisation_id', 'r_date', 'b_employee', 'kpi_date', 'abc_code'
    ).agg(
        F.sum(F.when(F.col('kpi_id') == 1234, F.col('value'))).alias('xxx'),
        F.sum(F.when(F.col('kpi_id') == 5678, F.col('value'))).alias('YYY'),
    ).orderBy(
        F.desc('kpi_date'), F.col('abc_code')
    )
nh_rate_df = nh_rate_df.join(s_function_df, 'abc_code', 'left')
nh_rate_df = nh_rate_df.join(hst_df, 'organisation_id', 'left')

I get a result that doesn't summarize all of the matching lines.我得到的结果没有总结所有匹配的行。

r_date r_date kpi_date kpi_date organisation_id组织 ID abc_code abc_code b_empl b_empl XXX XXX YYY年年
2020-12-02 2020-12-02 2020-11-01 00:00:00 2020-11-01 00:00:00 12-3 12-3 AS55 AS55 A一个 1.0000 1.0000 null null
2020-12-02 2020-12-02 2020-11-01 00:00:00 2020-11-01 00:00:00 12-3 12-3 AS55 AS55 null null null null 1.0000 1.0000
2020-11-02 2020-11-02 2020-10-01 00:00:00 2020-10-01 00:00:00 12-3 12-3 AS55 AS55 A一个 null null 1.0000 1.0000
2020-11-02 2020-11-02 2020-10-01 00:00:00 2020-10-01 00:00:00 12-3 12-3 AS55 AS55 null null 1.0000 1.0000 null null
2020-10-02 2020-10-02 2020-09-01 00:00:00 2020-09-01 00:00:00 12-3 12-3 AS55 AS55 A一个 2.0000 2.0000 null null
2020-10-02 2020-10-02 2020-09-01 00:00:00 2020-09-01 00:00:00 12-3 12-3 AS55 AS55 null null null null 1.0000 1.0000
2020-09-22 2020-09-22 2020-08-01 00:00:00 2020-08-01 00:00:00 12-3 12-3 AS55 AS55 null null 1.0000 1.0000 1.0000 1.0000
2020-09-22 2020-09-22 2020-08-01 00:00:00 2020-08-01 00:00:00 12-3 12-3 AS55 AS55 A一个 null null null null

although i have the same date.虽然我有相同的日期。 can it be due to the join?可能是因为加入吗?

This result Iam getting if I run the sql code如果我运行 sql 代码,我会得到这个结果

r_date r_date kpi_date kpi_date organisation_id组织 ID abc_code abc_code b_empl b_empl XXX XXX YYY年年
2020-12-02 2020-12-02 2020-11-01 00:00:00 2020-11-01 00:00:00 12-3 12-3 AS55 AS55 A一个 1.0000 1.0000 2.0000 2.0000
2020-11-02 2020-11-02 2020-10-01 00:00:00 2020-10-01 00:00:00 12-3 12-3 AS55 AS55 A一个 1.0000 1.0000 2.0000 2.0000
2020-10-02 2020-10-02 2020-09-01 00:00:00 2020-09-01 00:00:00 12-3 12-3 AS55 AS55 A一个 2.0000 2.0000 1.0000 1.0000
2020-09-22 2020-09-22 2020-08-01 00:00:00 2020-08-01 00:00:00 12-3 12-3 AS55 AS55 null null 2.0000 2.0000 1.0000 1.0000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM