Spark：PySpark Slowness，写给Cassandra的内存问题

Question

I am using pyspark to aggregate and group a largish csv on a low end machine ; 我正在使用pyspark在低端计算机上聚合和分组较大的csv； 4 GB Ram and 2 CPU Core. 4 GB Ram和2 CPU Core。 This is done to check the memory limits for the prototype. 这样做是为了检查原型的内存限制。 After aggregation I need to store the RDD to Cassandra which is running in another machine. 聚合后，我需要将RDD存储到在另一台计算机上运行的Cassandra中。

I am using Datastax cassandra-python driver. 我正在使用Datastax cassandra-python驱动程序。 First I used rdd.toLocalIterator and iterated through the RDD and used the drivers synchronous API session.execute . 首先，我使用rdd.toLocalIterator并通过RDD进行迭代，并使用驱动程序同步API session.execute 。 I managed to insert about 100,000 records in 5 mts- very slow. 我设法在5毫秒内插入约100,000条记录-非常慢。 Checking this I found as explained here python driver cpu bound , that when running nload nw monitor in the Cassandra node, the data put out by the python driver is at a very slow rate, causing the slowness 检查一下，我发现这里是python driver cpu bound的解释，当在Cassandra节点中运行nload nw monitor时，python驱动程序输出的数据的速度非常慢，从而导致速度缓慢

So I tried session.execute_async and I could see the NW transfer at very high speed, and insertion time was also very fast. 因此，我尝试使用session.execute_async ，可以看到NW的传输速度非常快，插入时间也非常快。

This would have been a happy story but for the fact that, using session.execute_async, I am now running out of memory while inserting to a few more tables (with different primary keys) 这本来是个快乐的故事，但事实上，使用session.execute_async，我现在用完了内存，同时插入了更多表（具有不同的主键）

Since rdd.toLocalIterator is said to need memory equal to a partition, I shifted the write to Spark worker using rdd.foreachPartition(x) , but still going out of memory. 由于rdd.toLocalIterator需要与分区相等的内存，因此我使用rdd.foreachPartition(x)将写入操作转移到Spark辅助rdd.foreachPartition(x) ，但仍然会耗尽内存。

I am doubting that it is not the rdd iteration that causes this, but the fast serialization ? 我怀疑不是rdd迭代导致此，而是快速序列化？ of execute_async of the python driver (using Cython) python驱动程序的execute_async（使用Cython）

Of course I can shift to a bigger RAM node and try; 当然，我可以转移到更大的RAM节点并尝试； but it would be sweet to solve this problem in this node; 但是在这个节点上解决这个问题会很不错； maybe will try also multiprocessing next; 也许接下来还会尝试多处理； but if there are better suggestions please reply 但是，如果有更好的建议，请回复

the memory error I am getting is from JVM/or OS outofmemory, 我遇到的内存错误是来自JVM /或OS内存不足，

6/05/27 05:58:45 INFO MapOutputTrackerMaster: Size of output statuses for 
shuffle 0 is 183 bytes
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fdea10cc000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ec2-user/hs_err_pid3208.log

Answer 1

I tried the execution in a machine with a bigger RAM - 16 GB; 我在具有更大RAM（16 GB）的计算机上尝试执行； This time I am able to avert the Out if Memory scenario of above; 这次，我可以避免上述“如果内存不足”的情况；

However this time I change the insert a bit to insert to multiple table; 但是这一次我将插入位置更改为要插入到多个表中。

So even with session.executeAsysc too I am finding that the python driver is CPU bound (and I guess due to GIL not able to make use of all CPU cores), and what goes out in the NW is a trickle. 因此，即使使用session.executeAsysc我也发现python驱动程序受CPU限制（而且我猜是由于GIL无法利用所有CPU内核），而在NW中发生的事情还是很麻烦的。

So I am not able to attain case 2; 因此我无法达到第2种情况； Planning to change to Scala now 计划现在更改为Scala

Case 1: Very less output to the NW - Write speed is fast but nothing to write 情况1：NW的输出很少-写入速度很快，但是没有写入

Case 2: Ideal Case - Inserts being IO bound: Cassandra writes very fast 情况2：理想情况-插入受IO约束：Cassandra写入速度非常快

Spark：PySpark Slowness，写给Cassandra的内存问题

问题描述

1 个解决方案

解决方案1
0 2016-06-08 12:01:26

Spark：PySpark Slowness，写给Cassandra的内存问题

问题描述

1 个解决方案

解决方案1 0 2016-06-08 12:01:26

解决方案1
0 2016-06-08 12:01:26