简体   繁体   English

将Spark作业写入实木复合地板-具有一个具有不断增加的物理内存的容器

[英]Spark job writing to parquet - has a container with physical memory that keeps increasing

I have a spark streaming application that reads from kafka topic and writes the data to hdfs in parquet format. 我有一个Spark Streaming应用程序,该应用程序从kafka主题读取并以镶木地板格式将数据写入hdfs。 I see that during time(very short time) the physical memory of the container keeps growing until it reaches the maximum size and fails on "Diagnostics: Container [pid=29328,containerID=container_e42_1512395822750_0026_02_000001] is running beyond physical memory limits. Current usage: 1.5 GB of 1.5 GB physical memory used; 2.3 GB of 3.1 GB virtual memory used. Killing container." 我看到在一段时间内(非常短的时间内),容器的物理内存一直在增长,直到达到最大大小,然后在“诊断:容器[pid = 29328,containerID = container_e42_1512395822750_0026_02_000001]运行超出物理内存限制。”的当前使用情况:使用1.5 GB的1.5 GB物理内存;使用2.3 GB的3.1 GB虚拟内存。杀死容器。” The container that is being killed is the same that runs the driver so the application is also being killed. 被杀死的容器与运行驱动程序的容器相同,因此该应用程序也被杀死。 When looking for this error I only saw solutions of increasing the memory but this I think will only postpone the problem. 寻找此错误时,我仅看到增加内存的解决方案,但我认为这只会推迟该问题。 I want to understand why is the memory keeps increasing if I don't save anything in memory. 我想了解为什么如果我在内存中不保存任何内容,那么内存会不断增加。 I also saw that all containers have increases in memory but they are just being killed after a while(before reaching the maximum). 我还看到所有容器的内存都增加了,但是它们只是在一段时间(达到最大值)之前被杀死。 I saw in some post "Your job is writing out Parquet data, and Parquet buffers data in memory prior to writing it out to disk". 我在一些帖子中看到“您的工作是写出Parquet数据,并且Parquet在将数据写到磁盘之前先将其缓冲在内存中”。

The code we are using (we also tried without the repartition - not sure that is needed): 我们正在使用的代码(我们也尝试过不进行分区-不确定是否需要):

val repartition = rdd.repartition(6)
val df: DataFrame = sqlContext.read.json(repartition)
df.write.mode(SaveMode.Append).parquet(dbLocation)

Is there some way to just fix the increasing memory problem ? 有什么方法可以解决不断增加的内存问题吗?

The created parquet files 创建的实木复合地板文件 创建的实木复合地板文件

The nodeManager logs that show the increase in the memory 显示内存增加的nodeManager日志 在此处输入图片说明在此处输入图片说明在此处输入图片说明在此处输入图片说明在此处输入图片说明在此处输入图片说明

Assuming your application does nothing other than just writes, I suspect the root cause to be size of data being received in batches. 假设您的应用程序不只是写操作,我怀疑根本原因是批量接收数据的大小。 It's possible that data received in one of the batches is beyond the thresholds configured. 一批中接收到的数据有可能超出配置的阈值。 Assuming the application is killed for this season, the solution is to enable " back pressure ". 假设此季节的应用程序被终止,解决方案是启用“ 背压 ”。 The solution is detailed enough in the post below. 该解决方案在下面的帖子中已经足够详细了。

Limit Kafka batches size when using Spark Streaming 使用Spark Streaming时限制Kafka批次大小

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM