简体   繁体   English

如何在不使用collec()的情况下将scd中的RDD [CassandraRow]转换为List [CassandraRow]

[英]How to convert RDD[CassandraRow] to List[CassandraRow] in scala without using collec()

I have RDD[CassadraRow] to List[CassandraRow] in scala. 我在scala中将RDD [CassadraRow]列出了[CassandraRow]。 in below code I m getting memory leak problem : 在下面的代码中我遇到内存泄漏问题:

val rowKeyRdd: Array[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

val clientPartitionKeys = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress"))).toList

val clientRdd: RDD[CassandraRow] =
sc.parallelize(clientPartitionKeys).joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()

I have remove the cache() then stil getting problem. 我已经删除了cache()然后仍然出现问题。

 org.jboss.netty.channel.socket.nio.AbstractNioSelector
 WARNING: Unexpected exception in the selector loop.
 java.lang.OutOfMemoryError: Java heap space
at org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:80)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR 2016-02-12 07:54:48 akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:超出了GC开销限制

How to avoid the memory leak. 如何避免内存泄漏。 I tried with 8GB per core. 我尝试使用每核8GB。 and table contains milion of records. 和表包含数百万条记录。

In this line, your variable name suggests you have an RDD but in fact, because you are using collect() it is not an RDD, as your type declaration shows, it is an Array: 在这一行中,您的变量名表明您有一个RDD,但实际上,因为您使用的是collect()所以它不是RDD,如类型声明所示,它是一个Array:

val rowKeyRdd: Array[CassandraRow] =
  sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

This pulls all the data from the workers into the Driver program, so the amount of memory at the workers (8GB per core) is not the problem, there is not enough memory in the Driver to handle this collect. 这会将所有数据从工作程序中提取到驱动程序中,因此工作程序上的内存量(每个内核8GB)不是问题,驱动程序中没有足够的内存来处理此收集。

Since all you do with this data is map it, and then re-parallelize it back to an RDD, instead you should map it without ever calling collect() . 由于您对这些数据所做的全部工作就是将其映射,然后将其重新并行化为RDD,因此您应该映射它而不必调用collect() I haven't tried the code below since I don't have access to your data set but it should be approximately correct: 我没有尝试下面的代码,因为我无法访问您的数据集,但应该大致正确:

val rowKeyRdd: RDD[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress")

val clientPartitionKeysRDD = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress")))

val clientRdd: RDD[CassandraRow] =
clientPartitionKeysRDD.joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM