[英]Can we reliably keep HTTP/S connection open for a long time?
My team maintains an application (written on Java) which processes long running batch jobs.我的团队维护着一个处理长时间运行的批处理作业的应用程序(用 Java 编写)。 These jobs needs to be run on a defined sequence.
这些作业需要按定义的顺序运行。 Hence, the application starts a socket server on a pre-defined port to accept job execution requests.
因此,应用程序在预定义端口上启动套接字服务器以接受作业执行请求。 It keeps the socket open until the job completes (with success or failure).
它使套接字保持打开状态,直到作业完成(成功或失败)。 This way the job scheduler knows when one job ends and upon successful completion of the job, it triggers the next job in the pre-defined sequence.
通过这种方式,作业调度程序知道一个作业何时结束,并且在作业成功完成后,它会按预定义的顺序触发下一个作业。 If the job fails, scheduler sends out an alert.
如果作业失败,调度程序会发出警报。
This is a setup we have had for over a decade.这是我们已经有十多年的设置。 We have some jobs which runs for a few minutes and other which takes a couple hours (depending on the volume) to complete.
我们有一些作业只运行几分钟,而另一些则需要几个小时(取决于数量)才能完成。 The setup has worked without any issues.
该设置工作正常,没有任何问题。
Now, we need to move this application to a container (RedHat OpenShift Container Platform) and the infra policy in place allows only default HTTPS port be exposed.现在,我们需要将该应用程序移动到一个容器(RedHat OpenShift 容器平台),并且现有的基础设施策略只允许公开默认的 HTTPS 端口。 The scheduler sits outside OCP and cannot access any port other than the default HTTPS port.
调度程序位于 OCP 之外,无法访问除默认 HTTPS 端口以外的任何端口。
In theory, we could use the HTTPS, set Client timeout to a very large duration and try to mimic the the current setup with TCP socket.理论上,我们可以使用 HTTPS,将客户端超时设置为非常长的持续时间,并尝试模仿 TCP 套接字的当前设置。 But would this setup be reliable enough as HTTP protocol is designed to serve short-lived requests?
但是这种设置是否足够可靠,因为 HTTP 协议旨在为短期请求提供服务?
There isn't a reliable way to keep a connection alive for a long period over the inte.net, because of nodes (routers, load balancers, proxies, nat gateways, etc) that may be sitting between your client and server, they might drop mid connection under load, some of them will happily ignore your HTTP keep alive request, or have an internal max connection duration time that will kill long running TCP connections, you may find it works for you today but there is no guarantee it will work for you tomorrow.由于节点(路由器、负载均衡器、代理、nat 网关等)可能位于您的客户端和服务器之间,因此没有一种可靠的方法可以在 inte.net 上长时间保持连接,它们可能在负载下放弃中间连接,他们中的一些人会很高兴地忽略你的 HTTP 保持活动请求,或者有一个内部最大连接持续时间会杀死长时间运行的 TCP 连接,你今天可能会发现它对你有用,但不能保证它会工作明天给你。
So you'll probably need to submit the job as a short lived request and check the status via other means:因此,您可能需要将作业作为短期请求提交并通过其他方式检查状态:
When your worked with socket and TCPprotocol you were in control on how long to keep connections open.当您使用套接字和 TCP 协议时,您可以控制保持连接打开的时间。 With HTTP you are only in control of logical connections and not physical ones.
使用 HTTP,您只能控制逻辑连接,而不能控制物理连接。 Actual connections are controlled by OS and usually IT people can configure all those timeouts.
实际连接由操作系统控制,通常 IT 人员可以配置所有这些超时。 But by default how it works is that when you even close logical connection the real connection is no closed in anticipation of next communication.
但默认情况下,它的工作原理是,当您甚至关闭逻辑连接时,实际连接并没有关闭,以等待下一次通信。 It is closed by OS and not controlled by your code.
它由操作系统关闭,不受您的代码控制。 However even if it closes and your next request comes after that it is opened transparently to you.
但是,即使它关闭并且您的下一个请求之后它也会对您透明地打开。 SO it doesn't really matter if it closed or not.
所以它是否关闭并不重要。 It should be transparent to your code.
它应该对您的代码透明。 So in short I assume that you can move to HTTP/HTTPS with no problems.
所以简而言之,我假设您可以毫无问题地迁移到 HTTP/HTTPS。 But you will have to test and see.
但是你必须测试看看。
Also about other options on server to client communications you can look at my answer to this question: How to continues send data from backend to frontend when something changes关于服务器到客户端通信的其他选项,您可以查看我对这个问题的回答: How to continues send data from backend to frontend when something changes
IMHO, you should improve your scheduler to a REST API server, Websocket isn't effective in this scenario, the connection will inactive most of time恕我直言,您应该将调度程序改进为 REST API 服务器,Websocket 在这种情况下无效,大部分时间连接将处于非活动状态
The jobs can be short-lived or long running.这些作业可以是短暂的或长期运行的。 So, When a long running job fails in the middle, how does the restart of the job happen?
那么,当一个长时间运行的作业中途失败时,作业的重启是如何发生的呢? Does it start from beginning again?
是从头再来吗?
In a similar scenario, we had a database to keep track of the progress of the job (no of records successfully processed).在类似的场景中,我们有一个数据库来跟踪作业的进度(没有成功处理的记录)。 So, the jobs can resume after a failure.
因此,作业可以在失败后恢复。 With such a design, another webservice can monitor the status of the job by looking at the database.
通过这样的设计,另一个 web 服务可以通过查看数据库来监控作业的状态。 So, the main process is not impacted by constant polling by the client.
因此,主进程不受客户端不断轮询的影响。
We have had bad experiences with long standing HTTP/HTTPS connections.我们在长期存在的 HTTP/HTTPS 连接方面有过糟糕的经历。 We used to schedule short jobs (only a couple of minutes) via HTTP and wait for it to finish and send a response.
我们过去常常通过 HTTP 安排短期工作(只有几分钟)并等待它完成并发送响应。 This worked fine, until the jobs got longer (hours) and some.network infrastructure closed the inactive connections.
这工作正常,直到工作变得更长(小时)并且一些网络基础设施关闭了非活动连接。 We ended up only submitting the request via HTTP, get an immediate response and then implemented a polling to wait for the response.
我们最终只通过 HTTP 提交请求,立即得到响应,然后实施轮询以等待响应。 At the time, the migration was pretty quick for us, but since then we have migrated it even further to use "webhooks", eg allow the processor of the job to signal it's state back to the server using a known webhook address.
当时,迁移对我们来说非常快,但从那时起我们进一步迁移它以使用“webhooks”,例如允许作业的处理器使用已知的 webhook 地址将其 state 发回服务器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.