[英]Pyspark SQL: using case when statements
我有一个看起来像这样的数据框
>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
| 0| 0|
| 0| 0|
| 0| 1|
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 1|
| 1| 1|
| 1| 0|
| 1| 0|
+-----------+--------------+
only showing top 10 rows
high_income
列是一个二进制列,并保留0
或1
。 aml_cluster_id
保存从0
到3
值。 我想创建其值取决于值的新列high_income
和aml_cluster_id
特定排。 我正在尝试使用SQL实现此目的。
df_w_cluster.createTempView('event_rate_holder')
为此,我编写了如下查询:
q = """select * , case
when "aml_cluster_id" = 0 and "high_income" = 1 then "high_income_encoded" = 0.162 else
when "aml_cluster_id" = 0 and "high_income" = 0 then "high_income_encoded" = 0.337 else
when "aml_cluster_id" = 1 and "high_income" = 1 then "high_income_encoded" = 0.049 else
when "aml_cluster_id" = 1 and "high_income" = 0 then "high_income_encoded" = 0.402 else
when "aml_cluster_id" = 2 and "high_income" = 1 then "high_income_encoded" = 0.005 else
when "aml_cluster_id" = 2 and "high_income" = 0 then "high_income_encoded" = 0.0 else
when "aml_cluster_id" = 3 and "high_income" = 1 then "high_income_encoded" = 0.023 else
when "aml_cluster_id" = 3 and "high_income" = 0 then "high_income_encoded" = 0.022 else
from event_rate_holder"""
当我使用运行火花
spark.sql(q)
我收到以下错误
mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)
任何想法如何克服这个?
编辑 :
我根据以下注释中的建议编辑了查询
q = """select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
when aml_cluster_id = 0 and high_income = 0 then high_income_encoded = 0.337 else
when aml_cluster_id = 1 and high_income = 1 then high_income_encoded = 0.049 else
when aml_cluster_id = 1 and high_income = 0 then high_income_encoded = 0.402 else
when aml_cluster_id = 2 and high_income = 1 then high_income_encoded = 0.005 else
when aml_cluster_id = 2 and high_income = 0 then high_income_encoded = 0.0 else
when aml_cluster_id = 3 and high_income = 1 then high_income_encoded = 0.023 else
when aml_cluster_id = 3 and high_income = 0 then high_income_encoded = 0.022 end
from event_rate_holder"""
但我仍然遇到错误
== SQL ==
select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
-----^^^
其次是
pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,
您使用的CASE
变体的正确语法是
CASE
WHEN e1 THEN e2 [ ...n ]
[ ELSE else_result_expression ]
END
所以
name = something
地方name = something
那里有name = something
。 CASE
允许ELSE
一次,而不是每个WHEN
。 END
你可能是说
CASE
WHEN aml_cluster_id = 0 AND high_income = 1 THEN 0.162
WHEN aml_cluster_id = 0 and high_income = 0 THEN 0.337
...
END AS high_income_encoded
当查询中的条件满足时,您将需要为每个案例加上大小写结尾 。 并且您将需要对列名称 ( ) and
勾号标记, ) and
high_income_encoded`列名称应在末尾加上别名 。 所以正确的查询如下
q = """select * ,
case when `aml_cluster_id` = 0 and `high_income` = 1 then 0.162 else
case when `aml_cluster_id` = 0 and `high_income` = 0 then 0.337 else
case when `aml_cluster_id` = 1 and `high_income` = 1 then 0.049 else
case when `aml_cluster_id` = 1 and `high_income` = 0 then 0.402 else
case when `aml_cluster_id` = 2 and `high_income` = 1 then 0.005 else
case when `aml_cluster_id` = 2 and `high_income` = 0 then 0.0 else
case when `aml_cluster_id` = 3 and `high_income` = 1 then 0.023 else
case when `aml_cluster_id` = 3 and `high_income` = 0 then 0.022
end
end
end
end
end
end
end
end as `high_income_encoded`
from event_rate_holder"""
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.