简体   繁体   English

Apache Spark的Some过滤器

[英]Apache Spark filter by Some

I have the following leftOuterJoin operation: 我有以下leftOuterJoin操作:

val totalsAndProds = transByProd.leftOuterJoin(products)
println(totalsAndProds.first())

which prints: 打印:

(19,([Ljava.lang.String;@261ea657,Some([Ljava.lang.String;@25290bca)))

then I try to apply the following filter operations: 然后我尝试应用以下filter操作:

totalsAndProds.filter(x => x._2 == Some).first

but it fails with the following exception: 但失败,但以下异常:

Exception in thread "main" java.lang.UnsupportedOperationException: empty collection
    at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1380)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
    at com.example.spark.WordCount$.main(WordCount.scala:98)
    at com.example.spark.WordCount.main(WordCount.scala)

what am I doing wrong and the filter operation returns the empty collection? 我在做什么错,筛选器操作返回空集合?

Your predicate is wrong: 您的谓词是错误的:

  1. Your RDD type is (Int, (Array[String], Option[Array[String]])) , therefore _._2 is of type (Array[String], Option[Array[String]]) , not Option[Array[String]] 您的RDD类型是(Int, (Array[String], Option[Array[String]])) ,因此_._2(Array[String], Option[Array[String]]) ,而不是Option[Array[String]]
  2. You do not check Option types using equals. 您不使用等于检查选项类型。

Try 尝试

totalsAndProds.filter{ case (_, (_, s)) => s.isDefined }

Example below: 下面的例子:

scala> val rdd = sc.parallelize(List((19, (Array("a"), Some(Array("a"))))))
rdd: org.apache.spark.rdd.RDD[(Int, (Array[String], Some[Array[String]]))] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.filter{ case (_, (_, s)) => s.isDefined }
res0: org.apache.spark.rdd.RDD[(Int, (Array[String], Some[Array[String]]))] = MapPartitionsRDD[1] at filter at <console>:27

scala> rdd.filter{ case (_, (_, s)) => s.isDefined }.collect
res1: Array[(Int, (Array[String], Some[Array[String]]))] = Array((19,(Array(a),Some([Ljava.lang.String;@5307fee))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Spark过滤器元素 - Apache Spark filter elements 很少条件过滤Apache Spark - Few conditions filter Apache Spark Apache Spark:处理RDD中的Option / Some / None - Apache Spark: dealing with Option/Some/None in RDDs 根据Apache Spark中数组中的单词过滤DataFrame - Filter DataFrame based on words in array in Apache Spark 如何在Apache Spark中删除以某些单词开头的多个hdfs目录 - How to delete multiple hdfs directories starting with some word in Apache Spark Apache Spark SQL数据框按字符串过滤多条规则 - Apache Spark SQL dataframe filter multi-rules by string Apache Spark ML Pipeline:过滤数据集中的空行 - Apache Spark ML Pipeline: filter empty rows in dataset 为什么Apache Ignite(版本2.3.0)Spark SharedRDD忘记了一些(20-30%)条目 - Why Apache Ignite (version 2.3.0) Spark SharedRDD forgot some (20-30%) of entries 在 Apache Spark 中,当增加工人数量时,对于一些小数据集无法达到更好的加速 - Can not reach better speed up in Apache Spark for some small datasets when increase the number of workers 为什么在过滤器中使用集合会导致“ org.apache.spark.SparkException:任务无法序列化”? - Why does using a set in filter cause “org.apache.spark.SparkException: Task not serializable”?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM