Pyspark SQL：用例When语句

Question

我有一个看起来像这样的数据框

>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
|          0|             0|
|          0|             0|
|          0|             1|
|          0|             1|
|          0|             0|
|          0|             0|
|          0|             1|
|          1|             1|
|          1|             0|
|          1|             0|
+-----------+--------------+
only showing top 10 rows

high_income列是一个二进制列，并保留0或1 。 aml_cluster_id保存从0到3值。 我想创建其值取决于值的新列high_income和aml_cluster_id特定排。 我正在尝试使用SQL实现此目的。

df_w_cluster.createTempView('event_rate_holder')

为此，我编写了如下查询：

q = """select * , case 
 when "aml_cluster_id" = 0 and  "high_income" = 1 then "high_income_encoded" = 0.162 else 
 when "aml_cluster_id" = 0 and  "high_income" = 0 then "high_income_encoded" = 0.337 else 
 when "aml_cluster_id" = 1 and  "high_income" = 1 then "high_income_encoded" = 0.049 else 
 when "aml_cluster_id" = 1 and  "high_income" = 0 then "high_income_encoded" = 0.402 else 
 when "aml_cluster_id" = 2 and  "high_income" = 1 then "high_income_encoded" = 0.005 else 
 when "aml_cluster_id" = 2 and  "high_income" = 0 then "high_income_encoded" = 0.0 else 
 when "aml_cluster_id" = 3 and  "high_income" = 1 then "high_income_encoded" = 0.023 else 
 when "aml_cluster_id" = 3 and  "high_income" = 0 then "high_income_encoded" = 0.022 else 
 from event_rate_holder"""

当我使用运行火花

spark.sql(q)

我收到以下错误

mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)

任何想法如何克服这个？

编辑：

我根据以下注释中的建议编辑了查询

q = """select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
when aml_cluster_id = 0 and  high_income = 0 then high_income_encoded = 0.337 else 
when aml_cluster_id = 1 and  high_income = 1 then high_income_encoded = 0.049 else 
when aml_cluster_id = 1 and  high_income = 0 then high_income_encoded = 0.402 else 
when aml_cluster_id = 2 and  high_income = 1 then high_income_encoded = 0.005 else 
when aml_cluster_id = 2 and  high_income = 0 then high_income_encoded = 0.0 else 
when aml_cluster_id = 3 and  high_income = 1 then high_income_encoded = 0.023 else 
when aml_cluster_id = 3 and  high_income = 0 then high_income_encoded = 0.022 end
from event_rate_holder"""

但我仍然遇到错误

== SQL ==
select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
-----^^^

其次是

pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,

Answer 1

您使用的CASE变体的正确语法是

CASE  
   WHEN e1 THEN e2 [ ...n ]   
   [ ELSE else_result_expression ]   
END

所以

然后应跟表达。 没有name = something地方name = something那里有name = something 。
每个CASE允许ELSE一次，而不是每个WHEN 。
您的原始代码缺少结束END
最后，不应引用各列

你可能是说

CASE 
  WHEN aml_cluster_id = 0 AND high_income = 1 THEN 0.162
  WHEN aml_cluster_id = 0 and  high_income = 0 THEN  0.337
  ...
END AS high_income_encoded

Answer 2

当查询中的条件满足时，您将需要为每个案例加上大小写结尾 。 并且您将需要对列名称 （ ) and 勾号标记， ) and high_income_encoded`列名称应在末尾加上别名。 所以正确的查询如下

q = """select * ,
case when `aml_cluster_id` = 0 and  `high_income` = 1 then 0.162 else
  case when `aml_cluster_id` = 0 and  `high_income` = 0 then 0.337 else
    case when `aml_cluster_id` = 1 and  `high_income` = 1 then 0.049 else
      case when `aml_cluster_id` = 1 and  `high_income` = 0 then 0.402 else
        case when `aml_cluster_id` = 2 and  `high_income` = 1 then 0.005 else
          case when `aml_cluster_id` = 2 and  `high_income` = 0 then 0.0 else
            case when `aml_cluster_id` = 3 and  `high_income` = 1 then 0.023 else
              case when `aml_cluster_id` = 3 and  `high_income` = 0 then 0.022
              end
            end
          end
        end
      end
    end
  end
end as `high_income_encoded`
from event_rate_holder"""

Pyspark SQL：用例When语句

问题描述

2 个解决方案

解决方案1
3 已采纳

解决方案2
0 2018-05-14 15:30:14

Pyspark SQL：用例When语句

问题描述

2 个解决方案

解决方案1 3 已采纳

解决方案2 0 2018-05-14 15:30:14

解决方案1
3 已采纳

解决方案2
0 2018-05-14 15:30:14