简体   繁体   中英

RDD of CassandraRow not working with take-command - why?

I am doing some exercises of the DataStax VM.

A CassandraTable is given and I shall do some filtering and retrieving the top 5 elements using Spark API functions rather than cassandra-query-functions.

There I am doing the following:

val cassRdd = sc.cassandraTable("killr_video", "videos_by_year_title")
val cassRdd2 = cassRdd.filter(r=>r.getString("title") >= "T")
println("1" : + cassRdd2)
println("2" : + cassRdd2.count)
println("3" : + cassRdd2.take(5))
println("4" : + cassRdd2.take(5).count)

Results in:

  • 1: MapPartitionsRDD[185] at filter at :19
  • 2: 2250
  • 3: [Lcom.datastax.spark.connector.CassandraRow;@56fd2e09
  • 4: compile Error (missing arguments for method count in trait TraversableOnce

What I have expected:

  • 1: and 2: work as expected
  • 3: returns only one row? I would expect a RDD of 5 cassandra Rows
  • 4: this isn't the rdd count after 3:, hence I didn expect it to work, looks like its some kind of cassandraRow-count-method I was not intended to call

The solution given by Datastax uses the RDD and does a map-transformation on it, to only take the title and on that new title-rdd it does the filtering and the take-command.

Ok, works, but I don't understand, why take does not work on a RDD-of CassandraRow or what the result of that may be.

val cassRdd2 = cassRdd.map(r=>r.getString("title")).filter(t >= "T")

I thought the take-command on any RDD (regardless its contents) would do always the same, taking the first x elements resulting in a new RDD of the exact same type with a size of x elements.

rdd.take(n) actually moves n elements to the driver and returns them as an array, see ScalaDoc . If you want to print them:

println("3" : + cassRdd2.take(5).toList)

or cassRdd2.take(5).foreach(println) . The last line does not work as the method is called length (or size ) for arrays:

println("4" : + cassRdd2.take(5).length)

I mixed something up:

take is an action, I shouldn't expect a RDD (but what is it? some binary? does it have a name? some kind of collection? may also a single Value like String or int if it fits)

On that I shouldn't use count as used to do on RDDs, rather I should use size as used to do on java-collections. By the way, count is also an action, using an action after an action sound like dump but it was so intuitive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM