[英]Pyspark SQL: using case when statements
I have a data frame which looks like this 我有一个看起来像这样的数据框
>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
| 0| 0|
| 0| 0|
| 0| 1|
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 1|
| 1| 1|
| 1| 0|
| 1| 0|
+-----------+--------------+
only showing top 10 rows
The high_income
column is a binary column and hold either 0
or 1
. high_income
列是一个二进制列,并保留0
或1
。 The aml_cluster_id
holds values starting from 0
upto 3
. aml_cluster_id
保存从0
到3
值。 I want to create a new column whose values depend on the values of the high_income
and aml_cluster_id
in that particular row. 我想创建其值取决于值的新列high_income
和aml_cluster_id
特定排。 I am trying to achieve this using SQL. 我正在尝试使用SQL实现此目的。
df_w_cluster.createTempView('event_rate_holder')
To accomplish this, I have written a query like so - 为此,我编写了如下查询:
q = """select * , case
when "aml_cluster_id" = 0 and "high_income" = 1 then "high_income_encoded" = 0.162 else
when "aml_cluster_id" = 0 and "high_income" = 0 then "high_income_encoded" = 0.337 else
when "aml_cluster_id" = 1 and "high_income" = 1 then "high_income_encoded" = 0.049 else
when "aml_cluster_id" = 1 and "high_income" = 0 then "high_income_encoded" = 0.402 else
when "aml_cluster_id" = 2 and "high_income" = 1 then "high_income_encoded" = 0.005 else
when "aml_cluster_id" = 2 and "high_income" = 0 then "high_income_encoded" = 0.0 else
when "aml_cluster_id" = 3 and "high_income" = 1 then "high_income_encoded" = 0.023 else
when "aml_cluster_id" = 3 and "high_income" = 0 then "high_income_encoded" = 0.022 else
from event_rate_holder"""
when I run it in spark using 当我使用运行火花
spark.sql(q)
I get the following error 我收到以下错误
mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)
Any idea how to overcome this? 任何想法如何克服这个?
EDIT : 编辑 :
I edited the query according to the suggestion in the comments to the following 我根据以下注释中的建议编辑了查询
q = """select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
when aml_cluster_id = 0 and high_income = 0 then high_income_encoded = 0.337 else
when aml_cluster_id = 1 and high_income = 1 then high_income_encoded = 0.049 else
when aml_cluster_id = 1 and high_income = 0 then high_income_encoded = 0.402 else
when aml_cluster_id = 2 and high_income = 1 then high_income_encoded = 0.005 else
when aml_cluster_id = 2 and high_income = 0 then high_income_encoded = 0.0 else
when aml_cluster_id = 3 and high_income = 1 then high_income_encoded = 0.023 else
when aml_cluster_id = 3 and high_income = 0 then high_income_encoded = 0.022 end
from event_rate_holder"""
but I am still getting errors 但我仍然遇到错误
== SQL ==
select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
-----^^^
followed by 其次是
pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,
The correct syntax for the CASE
variant you use is 您使用的CASE
变体的正确语法是
CASE
WHEN e1 THEN e2 [ ...n ]
[ ELSE else_result_expression ]
END
So 所以
name = something
there. 没有name = something
地方name = something
那里有name = something
。 ELSE
is allowed once per CASE
, not after each WHEN
. 每个CASE
允许ELSE
一次,而不是每个WHEN
。 END
您的原始代码缺少结束END
You probably meant 你可能是说
CASE
WHEN aml_cluster_id = 0 AND high_income = 1 THEN 0.162
WHEN aml_cluster_id = 0 and high_income = 0 THEN 0.337
...
END AS high_income_encoded
You would need case end for each when conditions in the query. 当查询中的条件满足时,您将需要为每个案例加上大小写结尾 。 and you would need back tick for the column names ( ) and
high_income_encoded` column names should be aliased at the end . 并且您将需要对列名称 ( ) and
勾号标记, ) and
high_income_encoded`列名称应在末尾加上别名 。 So the correct query is as following 所以正确的查询如下
q = """select * ,
case when `aml_cluster_id` = 0 and `high_income` = 1 then 0.162 else
case when `aml_cluster_id` = 0 and `high_income` = 0 then 0.337 else
case when `aml_cluster_id` = 1 and `high_income` = 1 then 0.049 else
case when `aml_cluster_id` = 1 and `high_income` = 0 then 0.402 else
case when `aml_cluster_id` = 2 and `high_income` = 1 then 0.005 else
case when `aml_cluster_id` = 2 and `high_income` = 0 then 0.0 else
case when `aml_cluster_id` = 3 and `high_income` = 1 then 0.023 else
case when `aml_cluster_id` = 3 and `high_income` = 0 then 0.022
end
end
end
end
end
end
end
end as `high_income_encoded`
from event_rate_holder"""
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.