简体   繁体   English

Spark Cassandra Connector:用于理解错误(类型不匹配)

[英]Spark Cassandra Connector: for comprehension error (type mismatch)

Problem 问题

Maybe this is due to my lack of Scala knowledge, but it seems like adding another level to the for comprehension should just work. 也许这是由于我缺乏Scala知识所致,但似乎为理解增加了另一个层次应该行得通。 If the first for comprehension line is commented out, the code works. 如果第一个for comprehension行被注释掉,则该代码有效。 I ultimately want a Set[Int] instead of '1 to 2', but it serves to show the problem. 我最终希望使用Set [Int]而不是“ 1 to 2”,但是它可以显示问题。 The first two lines of the for should not need a type specifier, but I include it to show that I've tried the obvious. for的前两行不需要类型说明符,但我将其包括在内以表明我已经尝试了显而易见的方法。

Tools/Jars 工具/罐

  • IntelliJ 2016.1 IntelliJ 2016.1
  • Java 8 Java 8
  • Scala 2.10.5 斯卡拉2.10.5
  • Cassandra 3.x 卡桑德拉3.x
  • spark-assembly-1.6.0-hadoop2.6.0.jar (pre-built) spark-assembly-1.6.0-hadoop2.6.0.jar(预构建)
  • spark-cassandra-connector_2.10-1.6.0-M1-SNAPSHOT.jar (pre-built) spark-cassandra-connector_2.10-1.6.0-M1-SNAPSHOT.jar(预建)
  • spark-cassandra-connector-assembly-1.6.0-M1-SNAPSHOT.jar (I built) spark-cassandra-connector-assembly-1.6.0-M1-SNAPSHOT.jar(我建立了)

Code

case class NotifHist(intnotifhistid:Int, eventhistids:Seq[Int], yosemiteid:String, initiatorname:String)
case class NotifHistSingle(intnotifhistid:Int, inteventhistid:Int, dataCenter:String, initiatorname:String)

object SparkCassandraConnectorJoins {
  def joinQueryAfterMakingExpandedRdd(sc:SparkContext, orgNodeId:Int) {

  val notifHist:RDD[NotifHistSingle] = for {
    orgNodeId:Int <- 1 to 2   // comment out this line and it works
    notifHist:NotifHist <- sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", orgNodeId)
    eventHistId <- notifHist.eventhistids
  } yield NotifHistSingle(notifHist.intnotifhistid, eventHistId, notifHist.yosemiteid, notifHist.initiatorname)
  ...etc...
 }

Compilation Output 编译输出

Information:3/29/16 8:52 AM - Compilation completed with 1 error and 0 warnings in 1s 507ms
       /home/jpowell/Projects/SparkCassandraConnector/src/com/mir3/spark/SparkCassandraConnectorJoins.scala
**Error:(88, 21) type mismatch;
 found   : scala.collection.immutable.IndexedSeq[Nothing]
 required: org.apache.spark.rdd.RDD[com.mir3.spark.NotifHistSingle]
      orgNodeId:Int <- 1 to 2
                    ^**

Later 后来

@slouc Thanks for the comprehensive answer. @slouc感谢您的全面回答。 I was using the for comprehension's syntactic sugar to also keep state from the second statement to fill elements in the NotifHistSingle ctor, so I don't see how to get the equivalent map/flatmap to work. 我正在使用for理解的语法糖来使第二条语句的状态保持不变,以填充NotifHistSingle ctor中的元素,因此我看不到如何使等效的map / flatmap起作用。 Therefore, I went with the following solution: 因此,我采用以下解决方案:

def joinQueryAfterMakingExpandedRdd(sc:SparkContext, orgNodeIds:Set[Int]) {

  def notifHistForOrg(orgNodeId:Int): RDD[NotifHistSingle] = {
    for {
      notifHist <- sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", orgNodeId)
      eventHistId <- notifHist.eventhistids
    } yield NotifHistSingle(notifHist.intnotifhistid, eventHistId, notifHist.yosemiteid, notifHist.initiatorname)
  }
  val emptyTable:RDD[NotifHistSingle] = sc.emptyRDD[NotifHistSingle]
  val notifHistForAllOrgs:RDD[NotifHistSingle] = orgNodeIds.foldLeft(emptyTable)((accum, oid) => accum ++ notifHistForOrg(oid))
}

For comprehension is actually syntax sugar; 理解实际上是语法糖。 what's really going on underneath is a series of chained flatMap calls, with a single map at the end which replaces yield . 实际发生的是一系列链接的flatMap调用,最后有一个map替换了yield Scala compiler translates every for comprehension like this. Scala编译器会像这样翻译所有内容以进行理解。 If you use if conditions in your for comprehension, they are translated into filters, and if you don't yield anything foreach is used. 如果if条件中使用if条件,则将其转换为过滤器,并且如果不产生任何条件,则使用foreach For more information, see here . 有关更多信息,请参见此处

So, to explain on your case - this: 因此,要解释您的情况-这:

val notifHist:RDD[NotifHistSingle] = for {
  orgNodeId:Int <- 1 to 2   // comment out this line and it works
  notifHist:NotifHist <- sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", orgNodeId)
  eventHistId <- notifHist.eventhistids
} yield NotifHistSingle(...)

is actually translated by the compiler to this: 实际上由编译器翻译为:

val notifHist:RDD[NotifHistSingle] = (1 to 2)
  .flatMap(x => sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", x)
  .flatMap(x => x.eventhistids)
  .map(x => NotifHistSingle(...))

You are getting the error if you include the 1 to 2 line because that makes your for comprehension operate on a sequence (vector, to be more precise). 如果包含1 to 2行,则会出现错误,因为这会使您的理解力按序列(更准确地说是向量)进行运算。 So when invoking flatMap() , compiler expects you to follow up with a function that transforms each element of your vector to a GenTraversableOnce . 因此,在调用flatMap() ,编译器希望您可以使用将向量的每个元素转换为GenTraversableOnce的函数 If you take a closer look at the type of your for expression (most IDEs will display it just by hovering over it) you can see it for yourself: 如果您仔细查看for表达式的类型(大多数IDE只需将鼠标悬停在它上面就可以显示它),您可以自己看到它:

def flatMap[B, That](f: A => GenTraversableOnce[B])(implicit bf: CanBuildFrom[Repr, B, That]): That

This is the problem. 这就是问题。 Compiler doesn't know how to flatMap the vector 1 to 10 using a function that returns CassandraRDD . 编译器不知道如何flatMap矢量1 to 10使用返回功能CassandraRDD It wants a function that returns GenTraversableOnce . 它需要一个返回GenTraversableOnce的函数。 If you remove the 1 to 2 line then you remove this restriction. 如果删除1 to 2行,则删除此限制。

Bottom line - if you want to use a for comprehension and yield values out of it, you have to obey the type rules. 底线-如果要使用a进行理解并从中产生值,则必须遵守类型规则。 It's impossible to flatten a sequence consisting of elements which are not sequences and cannot be turned into sequences. 展平由非序列元素组成的序列是不可能的,并且不能变成序列。

You can always map instead of flatMap since map is less restrictive (it requires A => B instead of A => GenTraversableOnce[B] ). 您可以始终使用map而不是flatMap因为map的限制较少(它需要A => B而不是A => GenTraversableOnce[B] )。 This means that instead of getting all results in one giant sequence, you will get a sequence where each element is a group of results (one group for each query). 这意味着,您将获得一个序列,其中每个元素是一组结果(每个查询一组),而不是将所有结果按一个大序列进行。 You can also play around the types, trying to get a GenTraversableOnce from your query result (eg invoking sc.cassandraTable().where().toArray or something; I don't really work with Cassandra so I don't know). 您也可以尝试使用这些类型,尝试从查询结果中获取GenTraversableOnce (例如,调用sc.cassandraTable().where().toArray或其他什么东西;我实际上不使用Cassandra,所以我不知道)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM