[英]How to break beautifulsoup process on AWS EC2 before it gets killed?
I want to run a python process that involves web scraping using beautifulsoup 24/7 on some.org websites.我想在 some.org 网站上使用 beautifulsoup 24/7 运行涉及 web 抓取的 python 进程。 It runs smoothly on the majority of websites, however, for some rare exceptions, there is a spike in Network, as you can see from the image below.它在大多数网站上运行平稳,但是,对于一些罕见的例外,网络中存在峰值,如下图所示。 The python process (not the instance) gets killed. python 进程(不是实例)被杀死。
When scraping I avoid any.pdf or.jpg... so that the CPU usage never exceeds 15% (so far, no issues)抓取时我避免使用任何。pdf 或 .jpg... 这样 CPU 使用率永远不会超过 15%(到目前为止,没有问题)
Ideally, I would need a script that makes the process restart when is killed, but I have been told is not possible (unless some of you suggest me otherwise, perhaps an AWS service that I can configure to make a process restart when killed).理想情况下,我需要一个脚本来让进程在被杀死时重新启动,但我被告知这是不可能的(除非你们中的一些人建议我否则,也许我可以配置一个 AWS 服务以在被杀死时重新启动进程)。
Is there a way to limit the beautifoulsoup network capability and stop the scraping of that particular link before it reaches a network peak?有没有办法限制 beautifoulsoup 网络能力并在特定链接达到网络峰值之前停止抓取它?
Ideally, I would need a script that makes the process restart when is killed, but I have been told is not possible理想情况下,我需要一个脚本,使进程在被杀死时重新启动,但有人告诉我这是不可能的
Not sure if this is true.不确定这是否属实。
Simple example:简单的例子:
#!/bin/bash
ps -eaf | grep <your-process>
if [ $? -eq 1 ]
then
<start your script again>
else
echo "running"
fi
systemd
service so that it automatically restarts your process when it crashes.更复杂的解决方案是将您的进程/脚本设置为systemd
服务,以便它在崩溃时自动重新启动您的进程。See for example:参见例如:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.