简体   繁体   中英

How to convert RDD[CassandraRow] to List[CassandraRow] in scala without using collec()

I have RDD[CassadraRow] to List[CassandraRow] in scala. in below code I m getting memory leak problem :

val rowKeyRdd: Array[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

val clientPartitionKeys = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress"))).toList

val clientRdd: RDD[CassandraRow] =
sc.parallelize(clientPartitionKeys).joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()

I have remove the cache() then stil getting problem.

 org.jboss.netty.channel.socket.nio.AbstractNioSelector
 WARNING: Unexpected exception in the selector loop.
 java.lang.OutOfMemoryError: Java heap space
at org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:80)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR 2016-02-12 07:54:48 akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: GC overhead limit exceeded

How to avoid the memory leak. I tried with 8GB per core. and table contains milion of records.

In this line, your variable name suggests you have an RDD but in fact, because you are using collect() it is not an RDD, as your type declaration shows, it is an Array:

val rowKeyRdd: Array[CassandraRow] =
  sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

This pulls all the data from the workers into the Driver program, so the amount of memory at the workers (8GB per core) is not the problem, there is not enough memory in the Driver to handle this collect.

Since all you do with this data is map it, and then re-parallelize it back to an RDD, instead you should map it without ever calling collect() . I haven't tried the code below since I don't have access to your data set but it should be approximately correct:

val rowKeyRdd: RDD[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress")

val clientPartitionKeysRDD = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress")))

val clientRdd: RDD[CassandraRow] =
clientPartitionKeysRDD.joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM