[英]PySpark: multiple conditions in when clause
I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. 我想修改数据帧列(Age)的单元格值,其中当前它是空白的,我只会在另一列(Survived)的值为0时为相应的行进行修改,其中Age为空白。 If it is 1 in the Survived column but blank in Age column then I will keep it as null.
如果它在Survived列中为1但在Age列中为空,那么我将它保持为null。
I tried to use &&
operator but it didn't work. 我尝试使用
&&
运算符,但它没有用。 Here is my code: 这是我的代码:
tdata.withColumn("Age", when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
Any suggestions how to handle that? 任何建议如何处理? Thanks.
谢谢。
Error Message: 错误信息:
SyntaxError: invalid syntax
File "<ipython-input-33-3e691784411c>", line 1
tdata.withColumn("Age", when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
^
You get SyntaxError
error exception because Python has no &&
operator. 你得到
SyntaxError
错误异常,因为Python没有&&
运算符。 It has and
and &
where the latter one is the correct choice to create boolean expressions on Column
( |
for a logical disjunction and ~
for logical negation). 它具有
and
&
,其中后者是在Column
上创建布尔表达式的正确选择( |
用于逻辑析取,而~
用于逻辑否定)。
Condition you created is also invalid because it doesn't consider operator precedence . 您创建的条件也无效,因为它不考虑运算符优先级 。
&
in Python has a higher precedence than ==
so expression has to be parenthesized. &
在Python中具有比==
更高的优先级,因此表达式必须用括号括起来。
(col("Age") == "") & (col("Survived") == "0")
## Column<b'((Age = ) AND (Survived = 0))'>
On a side note when
function is equivalent to case
expression not WHEN
clause. 在旁注
when
函数等效于case
表达式而不是WHEN
子句。 Still the same rules apply. 仍然适用相同的规则。 Conjunction:
连词:
df.where((col("foo") > 0) & (col("bar") < 0))
Disjunction: 分离:
df.where((col("foo") > 0) | (col("bar") < 0))
You can of course define conditions separately to avoid brackets: 您当然可以单独定义条件以避免使用括号:
cond1 = col("Age") == ""
cond2 = col("Survived") == "0"
cond1 & cond2
它至少应该在pyspark 2.4中起作用
tdata = tdata.withColumn("Age", when((tdata.Age == "") & (tdata.Survived == "0") , "NewValue").otherwise(tdata.Age))
when in pyspark multiple conditions can be built using & (for and) and | 当在pyspark中 时 ,可以使用& (for和)和|来构建多个条件 (for or).
(for或)。
Note:In pyspark t is important to enclose every expressions within parenthesis () that combine to form the condition 注意:在pyspark中,重要的是将括号内的每个表达式括起来组合形成条件
%pyspark
dataDF = spark.createDataFrame([(66, "a", "4"),
(67, "a", "0"),
(70, "b", "4"),
(71, "d", "4")],
("id", "code", "amt"))
dataDF.withColumn("new_column",
when((col("code") == "a") | (col("code") == "d"), "A")
.when((col("code") == "b") & (col("amt") == "4"), "B")
.otherwise("A1")).show()
In Spark Scala code ( && ) or ( || ) conditions can be used within when function 在Spark Scala代码( && )或( || )中,条件可以在函数内使用
//scala
val dataDF = Seq(
(66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
)).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
when(col("code") === "a" || col("code") === "d", "A")
.when(col("code") === "b" && col("amt") === "4", "B")
.otherwise("A1")).show()
======================= =======================
Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66| a| 4| A|
| 67| a| 0| A|
| 70| b| 4| B|
| 71| d| 4| A|
+---+----+---+----------+
This code snippet is copied from sparkbyexamples.com 此代码段是从sparkbyexamples.com复制的
它应该是:
$when(((tdata.Age == "" ) & (tdata.Survived == "0")), mean_age_0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.