简体   繁体   中英

Filter columns in a Spark Scala Dataframe by a specific value when the columns are unknown

I have a dynamically created Spark Dataframe where I need to filter the Dataframe when any of the columns are "False" and store it in one table and store the row where none of the columns are false in one table. Column names and number of columns will never be known.

For example, if my table is

     Col1  Col2   Col3
Row1    True  False  False
Row2    True  True   True
Row3    False False  True
Row4    False False  False

Output should be Table 1:

    Col1  Col2   Col3
Row1    True  False  False
Row3    False False  True
Row4    False False  False

and Table 2

    Col1  Col2   Col3
Row2    True  True   True

I have tried:

val columns: Array[String] = testDF.columns
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'False'"))
val output: DataFrame = seqDfs.reduceRight(_ union _)

But it returns a lot of duplicate values and even if I clear the duplicate values, it wouldn't help me in creating table 2 as all rows in table 2 need to be true.

Any help will be appreciated. Thanks!

Here's code to create your DataFrame:

val df = Seq(
  (true, false, false), 
  (true, true, true),
  (false, false, true),
  (false, false, false)
).toDF("col1", "col2", "col3")

Let's append an all_true column:

val columns = df.columns
val df2 = df.withColumn("all_true", columns.map(col(_).===(true)).reduceLeft(_.&&(_)))

Here's the DataFrame with all true values:

df2.where($"all_true" === true).drop("all_true").show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|true|true|true|
+----+----+----+

Here's the DataFrame with values that are not all true:

df2.where($"all_true" === false).drop("all_true").show()

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| true|false|false|
|false|false| true|
|false|false|false|
+-----+-----+-----+

Great question, welcome to StackOverflow;)

BTW, spark-daria has a multiEquals() function where I grabbed this code, see this file .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM