[英]Sparksql filtering (selecting with where clause) with multiple conditions
Hi I have the following issue:您好,我有以下问题:
numeric.registerTempTable("numeric").
All the values that I want to filter on are literal null strings and not N/A or Null values.我要过滤的所有值都是文字空字符串,而不是 N/A 或空值。
I tried these three options:我尝试了这三个选项:
numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')
numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')
sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")
Unfortunately, numeric_filtered is always empty.不幸的是, numeric_filtered 总是空的。 I checked and numeric has data that should be filtered based on these conditions.我检查过,数字有应该根据这些条件过滤的数据。
Here are some sample values:以下是一些示例值:
Low High Normal低 高 正常
3.5 5.0 null 3.5 5.0 空
2.0 14.0 null 2.0 14.0 空
null 38.0 null空 38.0 空
null null null空空空
1.0 null 4.0 1.0 空 4.0
Your are using logical conjunction (AND).您正在使用逻辑连词 (AND)。 It means that all columns have to be different than 'null'
for row to be included.这意味着所有列都必须不同于'null'
才能包含行。 Lets illustrate that using filter
version as an example:让我们以使用filter
版本为例来说明:
numeric = sqlContext.createDataFrame([
('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'), ('null', '38.0', 'null'),
('null', 'null', 'null'), ('1.0', 'null', '4.0')],
('low', 'high', 'normal'))
numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
numeric_filtered_2 = numeric_filtered_1.where(
numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null| 4.0|
## +---+----+------+
numeric_filtered_3 = numeric_filtered_2.where(
numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+
All remaining methods you've tried follow exactly the same schema.您尝试过的所有剩余方法都遵循完全相同的模式。 What you need here is a logical disjunction (OR).您在这里需要的是逻辑分离 (OR)。
from pyspark.sql.functions import col
numeric_filtered = df.where(
(col('LOW') != 'null') |
(col('NORMAL') != 'null') |
(col('HIGH') != 'null'))
numeric_filtered.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
or with raw SQL:或使用原始 SQL:
numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
See also: Pyspark: multiple conditions in when clause另请参阅: Pyspark:when 子句中的多个条件
from pyspark.sql.functions import col, countDistinct
totalrecordcount = df.where("ColumnName is not null").select(countDistinct("ColumnName")).collect()[0][0]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.