简体   繁体   English

CassandraRow 的 RDD 不能使用 take-command - 为什么?

[英]RDD of CassandraRow not working with take-command - why?

I am doing some exercises of the DataStax VM.我正在做一些 DataStax VM 的练习。

A CassandraTable is given and I shall do some filtering and retrieving the top 5 elements using Spark API functions rather than cassandra-query-functions.给出了一个 CassandraTable,我将使用 Spark API 函数而不是 cassandra-query-functions 进行一些过滤和检索前 5 个元素。

There I am doing the following:在那里我正在做以下事情:

val cassRdd = sc.cassandraTable("killr_video", "videos_by_year_title")
val cassRdd2 = cassRdd.filter(r=>r.getString("title") >= "T")
println("1" : + cassRdd2)
println("2" : + cassRdd2.count)
println("3" : + cassRdd2.take(5))
println("4" : + cassRdd2.take(5).count)

Results in:结果是:

  • 1: MapPartitionsRDD[185] at filter at :19 1: MapPartitionsRDD[185] at filter at :19
  • 2: 2250 2:2250
  • 3: [Lcom.datastax.spark.connector.CassandraRow;@56fd2e09 3:[Lcom.datastax.spark.connector.CassandraRow;@56fd2e09
  • 4: compile Error (missing arguments for method count in trait TraversableOnce 4:编译错误(TraversableOnce trait 中缺少方法计数的参数

What I have expected:我所期望的:

  • 1: and 2: work as expected 1:和 2:按预期工作
  • 3: returns only one row? 3:只返回一行? I would expect a RDD of 5 cassandra Rows我希望 RDD 为 5 cassandra Rows
  • 4: this isn't the rdd count after 3:, hence I didn expect it to work, looks like its some kind of cassandraRow-count-method I was not intended to call 4:这不是 3: 之后的 rdd 计数,因此我没想到它会起作用,看起来像是我不打算调用的某种 cassandraRow-count-method

The solution given by Datastax uses the RDD and does a map-transformation on it, to only take the title and on that new title-rdd it does the filtering and the take-command. Datastax 给出的解决方案使用 RDD 并对其进行映射转换,仅获取标题,并在该新标题 rdd 上进行过滤和获取命令。

Ok, works, but I don't understand, why take does not work on a RDD-of CassandraRow or what the result of that may be.好的,有效,但我不明白,为什么 take 在 CassandraRow 的 RDD 上不起作用或结果可能是什么。

val cassRdd2 = cassRdd.map(r=>r.getString("title")).filter(t >= "T")

I thought the take-command on any RDD (regardless its contents) would do always the same, taking the first x elements resulting in a new RDD of the exact same type with a size of x elements.我认为任何 RDD(无论其内容如何)上的 take-command 总是相同的,取前 x 个元素产生一个完全相同类型的新 RDD,其大小为 x 个元素。

rdd.take(n) actually moves n elements to the driver and returns them as an array, see ScalaDoc . rdd.take(n)实际上将n元素移动到驱动程序并将它们作为数组返回,请参阅ScalaDoc If you want to print them:如果你想打印它们:

println("3" : + cassRdd2.take(5).toList)

or cassRdd2.take(5).foreach(println) .cassRdd2.take(5).foreach(println) The last line does not work as the method is called length (or size ) for arrays:最后一行不起作用,因为该方法被称为数组的length (或size ):

println("4" : + cassRdd2.take(5).length)

I mixed something up:我混淆了一些东西:

take is an action, I shouldn't expect a RDD (but what is it? some binary? does it have a name? some kind of collection? may also a single Value like String or int if it fits) take是一个动作,我不应该期待 RDD(但它是什么?一些二进制文件?它有名字吗?某种集合?如果合适,也可能是一个单一的值,如 String 或 int)

On that I shouldn't use count as used to do on RDDs, rather I should use size as used to do on java-collections.在这一点上,我不应该像在 RDD 上那样使用count ,而应该像在 java-collections 上那样使用size By the way, count is also an action, using an action after an action sound like dump but it was so intuitive.顺便说一句, count也是一个动作,在动作之后使用动作听起来像转储,但它是如此直观。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM