简体   繁体   English

在多列上的PySpark数据框过滤器

[英]PySpark dataframe filter on multiple columns

Using Spark 2.1.1 使用Spark 2.1.1

Below is my data frame 下面是我的数据框

id Name1   Name2

1 Naveen Srikanth 

2 Naveen Srikanth123

3 Naveen 

4 Srikanth Naveen

Now need to filter rows based on two conditions that is 2 and 3 need to be filtered out as name has number's 123 and 3 has null value 现在需要根据2和3这两个条件来过滤行,因为名称具有数字123,而3具有空值,因此需要将其过滤掉

using below code to filter only row id 2 使用以下代码仅过滤行ID 2

df.select("*").filter(df["Name2"].rlike("[0-9]")).show()

got stuck up to include second condition. 被困以包括第二个条件。

doing the following should solve your issue 执行以下操作应该可以解决您的问题

from pyspark.sql.functions import col
df.filter((!col("Name2").rlike("[0-9]")) | (col("Name2").isNotNull))

Should be as simple a putting multiple conditions into the filter. 将过滤器放入多个条件应该很简单。

val df = List(
  ("Naveen", "Srikanth"), 
  ("Naveen", "Srikanth123"), 
  ("Naveen", null), 
  ("Srikanth", "Naveen")).toDF("Name1", "Name2")

import spark.sqlContext.implicits._  
df.filter(!$"Name2".isNull && !$"Name2".rlike("[0-9]")).show

or if you prefer not use spark-sql $ : 或者如果您不喜欢使用spark-sql $

df.filter(!df("Name2").isNull && !df("Name2").rlike("[0-9]")).show 

or in Python: 或在Python中:

df.filter(df["Name2"].isNotNull() & ~df["Name2"].rlike("[0-9]")).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM