如何过滤出Spark数据框中的布尔字段？

Question

我的数据框中有三列。 在这第二和第三个是布尔字段。 我想过滤出正确的值。 我已经尝试过这个nn.filter(col("col3")===true).show但是它说无效的列名“ true”。 我的代码是什么？ 请问有什么帮助吗？

我的代码：

scala> nn.printSchema
root
 |-- ID: integer (nullable = true)
 |-- col2: boolean (nullable = true)
 |-- col3: boolean (nullable = true)

scala> nn.show
+---+-----+-----+
| ID| col2| col3|
+---+-----+-----+
|  4| true|false|
|  5|false|false|
|  6|false|false|
|  7|false|false|
| 12|false|false|
| 13|false|false|
| 14|false|false|
| 15|false| true|
| 16|false|false|
| 17|false|false|
| 18|false|false|
| 22|false|false|
| 36|false|false|
| 37|false|false|
| 38|false|false|
| 39|false|false|
| 40|false|false|
| 41| true|false|
| 42|false|false|
+---+-----+-----+

scala> nn.filter(col("col3")===true).show
[Stage 14:>                                                         (0 + 1) / 1]19/05/26 22:44:16 ERROR executor.Executor: Exception in task 0.0 in stage 14.0 (TID 14)
com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'true'.
        at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:217)
        at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1655)
        at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:440)
        at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:385)
        at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7505)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2445)
        at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:191)
        at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:166)
        at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:297)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:301)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Answer 1

您可以直接对布尔值应用过滤器。 为什么要在条件上应用col("col3")===true ？ 您的列值是布尔类型，当我们将条件应用于过滤器中的条件时，它会返回布尔值true或false 。 当您的列为布尔值时，为什么还要再次尝试相同的操作？

scala> val someDf = Seq((1, true, false), (2, true, true)).toDF("col1", "col2", "col3")
someDf: org.apache.spark.sql.DataFrame = [col1: int, col2: boolean ... 1 more field]

我们有DF，其值是：

scala> someDf.show
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
|   1|true|false|
|   2|true| true|
+----+----+-----+

现在应用过滤器：

scala> someDf.filter(col("col3")).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|true|true|
+----+----+----+

谢谢。

Answer 2

===被重新定义为Column.scala （参考Spark代码）
在您的情况下，重写方法将被调用。
为了避免这种情况，
1.在列对象之后添加一个空格，例如nn.filter(col("col3") === true) （ col("col3")之后的空格）或
2.使用@Learner建议的方法，例如nn.filter(col("col3"))

Answer 3

import spark.implicits._

val someDf = Seq((1, true, false), (2, true, true)).toDF("col1", "col2", "col3")

someDf.show()

import org.apache.spark.sql.functions._

someDf.filter(col("col3")===true).show()


Original DataFrame :
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
|   1|true|false|
|   2|true| true|
+----+----+-----+

Filtered Dataframe :
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|true|true|
+----+----+----+

如何过滤出Spark数据框中的布尔字段？

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-05-26 19:29:41

解决方案2
0 2019-05-27 03:19:13

解决方案3
0 2019-05-27 09:13:02

如何过滤出Spark数据框中的布尔字段？

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-05-26 19:29:41

解决方案2 0 2019-05-27 03:19:13

解决方案3 0 2019-05-27 09:13:02

解决方案1
2 已采纳 2019-05-26 19:29:41

解决方案2
0 2019-05-27 03:19:13

解决方案3
0 2019-05-27 09:13:02