[英]Spark Structured Streaming Blue/Green Deployments
We'd like to be able to deploy our Spark jobs such that there isn't any downtime in processing data during deployments (currently there's about a 2-3 minute window). 我们希望能够部署我们的Spark作业,以便在部署期间处理数据时没有任何停机时间(目前大约有2-3分钟的窗口)。 In my mind, the easiest way to do this is to simulate the "blue/green deployment" philosophy, which is to spin up the new version of the Spark job, let it warm up, then shut down the old job.
在我看来,最简单的方法是模拟“蓝/绿部署”理念,即启动新版本的Spark工作,让它热身,然后关闭旧工作。 However, with structured streaming & checkpointing, we cannot do this because the new Spark job sees that the latest checkpoint file already exists (from the old job).
但是,使用结构化流和检查点,我们无法做到这一点,因为新的Spark作业看到最新的检查点文件已经存在(来自旧作业)。 I've attached a sample error below.
我在下面附上了一个示例错误。 Does anyone have any thoughts on a potential workaround?
有没有人对潜在的解决方法有任何想法?
I thought about copying over the existing checkpoint directory to another checkpoint directory for the newly created job - while that should work as a workaround (some data might get reprocessed, but our DB should deduplicate), this seems super hacky and something I'd rather not pursue. 我想过将现有的检查点目录复制到新创建的作业的另一个检查点目录 - 虽然这应该作为一种解决方法(一些数据可能会被重新处理,但我们的数据库应该重复删除),这似乎超级hacky和我宁愿不追求。
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: rename destination /user/checkpoint/job/offsets/3472939 already exists
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.validateOverwrite(FSDirRenameOp.java:520)
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.unprotectedRenameTo(FSDirRenameOp.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameTo(FSDirRenameOp.java:282)
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameToInt(FSDirRenameOp.java:247)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3677)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rename2(NameNodeRpcServer.java:914)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.rename2(ClientNamenodeProtocolServerSideTranslatorPB.java:587)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:1991)
at org.apache.hadoop.fs.Hdfs.renameInternal(Hdfs.java:335)
at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.rename(HDFSMetadataLog.scala:356)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:160)
... 20 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): rename destination /user/checkpoint/job/offsets/3472939 already exists
It is possible, but it will add some complexity to your application. 这是可能的,但它会给您的应用程序增加一些复杂性。 Starting streams is in general fast, so it is fair to assume, that delay is caused by initialization of static objects and dependencies.
启动流通常很快,因此可以假设,延迟是由静态对象和依赖关系的初始化引起的。 In that case you'll need only
SparkContext
/ SparkSession
, and no streaming dependencies so process can be described as: 在这种情况下,您只需要
SparkContext
/ SparkSession
,并且没有流依赖性,因此进程可以描述为:
At the very high level, the happy path could be visualized as: 在非常高的层次上,幸福的道路可以被视为:
Since it is very generic pattern it could be implemented in a different ways, depending on a language and infrastructure: 由于它是非常通用的模式,它可以以不同的方式实现,具体取决于语言和基础结构:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.