简体   繁体   English

Spark结构化流存储绑定

[英]Spark Structured Streaming Memory Bound

I am processing a stream of 100 Mb/s average load. 我正在处理100 Mb / s的平均负载流。 I have six executors with each having 12 Gb of memory allocated. 我有六个执行器,每个执行器分配了12 Gb的内存。 However, due to data load, I am getting Out of Memory errors (Error 52) in the spark executors in few minutes. 但是,由于数据加载,我在几分钟之内就收到了火花执行器中的内存不足错误(错误52)。 It seems even though Spark dataframe is conceptually unbounded it is bounded by total executor memory? 看来即使星火据帧的概念无限它是由总执行内存为界?

My idea here was to save dataframe/stream as an in parquet in about every five minutes. 我的想法是在大约每五分钟内将数据帧/流保存为镶木地板。 However, it seems spark won't have a direct mechanism to purge the dataframe after that? 但是,似乎火花之后没有清除数据帧的直接机制了吗?

val out = df.
  writeStream.
  format("parquet").
  option("path", "/applications/data/parquet/customer").
  option("checkpointLocation", "/checkpoints/customer/checkpoint").
  trigger(Trigger.ProcessingTime(300.seconds)).
  outputMode(OutputMode.Append).
  start

It seems that there is no direct way to do this. 似乎没有直接的方法可以做到这一点。 As this conflicts with the general Spark model that operations be rerunnable in case of failure . 由于这与常规Spark模型冲突,因此在发生故障时可以重新运行操作

However I would share the same sentiment of the comment at 08/Feb/18 13:21 on this issue . 不过,我将分享相同情感的评论在08 /二月/ 18 13:21这个问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM