简体   繁体   English

Spark:PySpark Slowness,写给Cassandra的内存问题

[英]Spark: PySpark Slowness , memory issue in writing to Cassandra

I am using pyspark to aggregate and group a largish csv on a low end machine ; 我正在使用pyspark在低端计算机上聚合和分组较大的csv; 4 GB Ram and 2 CPU Core. 4 GB Ram和2 CPU Core。 This is done to check the memory limits for the prototype. 这样做是为了检查原型的内存限制。 After aggregation I need to store the RDD to Cassandra which is running in another machine. 聚合后,我需要将RDD存储到在另一台计算机上运行的Cassandra中。

I am using Datastax cassandra-python driver. 我正在使用Datastax cassandra-python驱动程序。 First I used rdd.toLocalIterator and iterated through the RDD and used the drivers synchronous API session.execute . 首先,我使用rdd.toLocalIterator并通过RDD进行迭代,并使用驱动程序同步API session.execute I managed to insert about 100,000 records in 5 mts- very slow. 我设法在5毫秒内插入约100,000条记录-非常慢。 Checking this I found as explained here python driver cpu bound , that when running nload nw monitor in the Cassandra node, the data put out by the python driver is at a very slow rate, causing the slowness 检查一下,我发现这里是python driver cpu bound的解释,当在Cassandra节点中运行nload nw monitor时,python驱动程序输出的数据的速度非常慢,从而导致速度缓慢

So I tried session.execute_async and I could see the NW transfer at very high speed, and insertion time was also very fast. 因此,我尝试使用session.execute_async ,可以看到NW的传输速度非常快,插入时间也非常快。

This would have been a happy story but for the fact that, using session.execute_async, I am now running out of memory while inserting to a few more tables (with different primary keys) 这本来是个快乐的故事,但事实上,使用session.execute_async,我现在用完了内存,同时插入了更多表(具有不同的主键)

Since rdd.toLocalIterator is said to need memory equal to a partition, I shifted the write to Spark worker using rdd.foreachPartition(x) , but still going out of memory. 由于rdd.toLocalIterator需要与分区相等的内存,因此我使用rdd.foreachPartition(x)将写入操作转移到Spark辅助rdd.foreachPartition(x) ,但仍然会耗尽内存。

I am doubting that it is not the rdd iteration that causes this, but the fast serialization ? 我怀疑不是rdd迭代导致此,而是快速序列化? of execute_async of the python driver (using Cython) python驱动程序的execute_async(使用Cython)

Of course I can shift to a bigger RAM node and try; 当然,我可以转移到更大的RAM节点并尝试; but it would be sweet to solve this problem in this node; 但是在这个节点上解决这个问题会很不错; maybe will try also multiprocessing next; 也许接下来还会尝试多处理; but if there are better suggestions please reply 但是,如果有更好的建议,请回复

the memory error I am getting is from JVM/or OS outofmemory, 我遇到的内存错误是来自JVM /或OS内存不足,

6/05/27 05:58:45 INFO MapOutputTrackerMaster: Size of output statuses for 
shuffle 0 is 183 bytes
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fdea10cc000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ec2-user/hs_err_pid3208.log

I tried the execution in a machine with a bigger RAM - 16 GB; 我在具有更大RAM(16 GB)的计算机上尝试执行; This time I am able to avert the Out if Memory scenario of above; 这次,我可以避免上述“如果内存不足”的情况;

However this time I change the insert a bit to insert to multiple table; 但是这一次我将插入位置更改为要插入到多个表中。

So even with session.executeAsysc too I am finding that the python driver is CPU bound (and I guess due to GIL not able to make use of all CPU cores), and what goes out in the NW is a trickle. 因此,即使使用session.executeAsysc我也发现python驱动程序受CPU限制(而且我猜是由于GIL无法利用所有CPU内核),而在NW中发生的事情还是很麻烦的。

So I am not able to attain case 2; 因此我无法达到第2种情况; Planning to change to Scala now 计划现在更改为Scala

Case 1: Very less output to the NW - Write speed is fast but nothing to write 情况1:NW的输出很少-写入速度很快,但是没有写入

向NW的输出很少-写入速度快,但没有写入

Case 2: Ideal Case - Inserts being IO bound: Cassandra writes very fast 情况2:理想情况-插入受IO约束:Cassandra写入速度非常快

理想情况-插入受IO约束:Cassandra写得非常快

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM