[英]Delete rows in PySpark dataframe based on multiple conditions
I have a dataframe with a structure similar to the following:我有一个结构类似于以下的数据框:
col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,C,A,D
A,F,A,A
A,V,B,A
What I want is to 'drop' the rows where conditions are met for all columns at the same time.我想要的是“删除”同时满足所有列条件的行。 For example, drop rows where
col1 == A
and col2 == C
at the same time.例如,同时删除
col1 == A
和col2 == C
行。 Note that, in this case, the only row that should be dropped would be "A,C,A,D"
as it's the only one where both conditions are met at the same time.请注意,在这种情况下,唯一应该删除的行是
"A,C,A,D"
因为它是唯一同时满足两个条件的行。 Hence, the dataframe should look like this:因此,数据框应如下所示:
col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,F,A,A
A,V,B,A
What I've tried so far is:到目前为止我尝试过的是:
# spark library import
import pyspark.sql.functions as F
df = df.filter(
((F.col("col1") != "A") & (F.col("col2") != "C"))
)
This one doesn't filter as I want, because it removes all rows where only one condition is met, like col1 == "A"
or col2 == "C"
, returning:这个没有按照我的意愿过滤,因为它删除了只满足一个条件的所有行,比如
col1 == "A"
或col2 == "C"
,返回:
col1, col2, col3, col4
B,C,A,D
Can anybody please help me out with this?有人可以帮我解决这个问题吗?
Thanks谢谢
from pyspark.sql.functions import when
df.withColumn('Result',when(df.col1!='A',"True").when(df.col2!='C',"True")).filter("Result==True").drop("Result").show()
This can be a working solution for you - use when()
condition and create a column based on the condition you wish for这对您来说可能是一个可行的解决方案 - 使用
when()
条件并根据您希望的条件创建一列
df = spark.createDataFrame([("A", "A", "A", "A"),("A", "B", "C", "D"),("B", "C", "A", "D"),("A", "C", "A", "D"), ("A", "F", "A", "A"), ("A", "V", "B", "A")],[ "col1","col2", "col3", "col4"])
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|A |A |A |A |
|A |B |C |D |
|B |C |A |D |
|A |C |A |D |
|A |F |A |A |
|A |V |B |A |
+----+----+----+----+
df = df.withColumn("filter_col", F.when(((F.col("col1") == F.lit("A")) & (F.col("col2") == F.lit("C"))), F.lit("1")))
df.show()
+----+----+----+----+----------+
|col1|col2|col3|col4|filter_col|
+----+----+----+----+----------+
| A| A| A| A| null|
| A| B| C| D| null|
| B| C| A| D| null|
| A| C| A| D| 1|
| A| F| A| A| null|
| A| V| B| A| null|
+----+----+----+----+----------+
df = df.filter(F.col("filter_col").isNull()).select("col1", "col2", "col3", "col4")
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| A| A| A|
| A| B| C| D|
| B| C| A| D|
| A| F| A| A|
| A| V| B| A|
+----+----+----+----+
Combine both conditions and do a NOT
:结合两个条件并执行
NOT
:
cond = (F.col('col1') == 'A') & (F.col('col2') == 'C')
df.filter(~cond)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.