How to convert RDD[CassandraRow] to List[CassandraRow] in scala without using collec()

Question

I have RDD[CassadraRow] to List[CassandraRow] in scala. in below code I m getting memory leak problem :

val rowKeyRdd: Array[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

val clientPartitionKeys = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress"))).toList

val clientRdd: RDD[CassandraRow] =
sc.parallelize(clientPartitionKeys).joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()

I have remove the cache() then stil getting problem.

 org.jboss.netty.channel.socket.nio.AbstractNioSelector
 WARNING: Unexpected exception in the selector loop.
 java.lang.OutOfMemoryError: Java heap space
at org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:80)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR 2016-02-12 07:54:48 akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: GC overhead limit exceeded

How to avoid the memory leak. I tried with 8GB per core. and table contains milion of records.

Answer 1

In this line, your variable name suggests you have an RDD but in fact, because you are using collect() it is not an RDD, as your type declaration shows, it is an Array:

val rowKeyRdd: Array[CassandraRow] =
  sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

This pulls all the data from the workers into the Driver program, so the amount of memory at the workers (8GB per core) is not the problem, there is not enough memory in the Driver to handle this collect.

Since all you do with this data is map it, and then re-parallelize it back to an RDD, instead you should map it without ever calling collect() . I haven't tried the code below since I don't have access to your data set but it should be approximately correct:

val rowKeyRdd: RDD[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress")

val clientPartitionKeysRDD = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress")))

val clientRdd: RDD[CassandraRow] =
clientPartitionKeysRDD.joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()

How to convert RDD[CassandraRow] to List[CassandraRow] in scala without using collec()

Question

1 answers

solution1
1 2016-02-12 09:23:12

How to convert RDD[CassandraRow] to List[CassandraRow] in scala without using collec()

Question

1 answers

solution1 1 2016-02-12 09:23:12

solution1
1 2016-02-12 09:23:12