简体   繁体   English

如何在 AWS EC2 上的 beautifulsoup 进程被杀死之前破坏它?

[英]How to break beautifulsoup process on AWS EC2 before it gets killed?

I want to run a python process that involves web scraping using beautifulsoup 24/7 on some.org websites.我想在 some.org 网站上使用 beautifulsoup 24/7 运行涉及 web 抓取的 python 进程。 It runs smoothly on the majority of websites, however, for some rare exceptions, there is a spike in Network, as you can see from the image below.它在大多数网站上运行平稳,但是,对于一些罕见的例外,网络中存在峰值,如下图所示。 The python process (not the instance) gets killed. python 进程(不是实例)被杀死。

When scraping I avoid any.pdf or.jpg... so that the CPU usage never exceeds 15% (so far, no issues)抓取时我避免使用任何。pdf 或 .jpg... 这样 CPU 使用率永远不会超过 15%(到目前为止,没有问题)

在此处输入图像描述

Ideally, I would need a script that makes the process restart when is killed, but I have been told is not possible (unless some of you suggest me otherwise, perhaps an AWS service that I can configure to make a process restart when killed).理想情况下,我需要一个脚本来让进程在被杀死时重新启动,但我被告知这是不可能的(除非你们中的一些人建议我否则,也许我可以配置一个 AWS 服务以在被杀死时重新启动进程)。

Is there a way to limit the beautifoulsoup network capability and stop the scraping of that particular link before it reaches a network peak?有没有办法限制 beautifoulsoup 网络能力并在特定链接达到网络峰值之前停止抓取它?

Ideally, I would need a script that makes the process restart when is killed, but I have been told is not possible理想情况下,我需要一个脚本,使进程在被杀死时重新启动,但有人告诉我这是不可能的

Not sure if this is true.不确定这是否属实。

  1. The easiest example that proves that you can do that is to create a bash script that checks if your python script/process is running and if not, it starts it.证明您可以做到这一点的最简单示例是创建一个 bash 脚本,该脚本检查您的 python 脚本/进程是否正在运行,如果没有,它将启动它。 The script can be scheduled using CRON to check every X minutes/seconds whatever you want.可以使用 CRON 安排该脚本,以便每隔 X 分钟/秒检查一次您想要的任何内容。

Simple example:简单的例子:

#!/bin/bash
ps -eaf | grep <your-process>
if [ $? -eq 1 ]
then
<start your script again>
else
echo "running"
fi
  1. The more involved solution would be to setup your process/script as a systemd service so that it automatically restarts your process when it crashes.更复杂的解决方案是将您的进程/脚本设置为systemd服务,以便它在崩溃时自动重新启动您的进程。

See for example:参见例如:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM