我的 python 代码占用了所有 RAM/CPU 资源并使服务器无法访问

Question

这是一个网站，我想听每一页的变化并将新值更新为 MongoDB。 我已经编写了一个 python 程序来利用 python 中的多处理模块，但是它占用了我所有的资源并使我的服务器无法访问。 告诉我它有什么问题以及是否存在更好的解决方案（我在考虑 Apache Spark Streaming 或 Kafka Connect to stream 每个链接的更新。）

更新：问题是我想听 600 个 web 链接以了解 MongoDB 中的值的更改和更新。

我的代码如下：

import pymongo
from bs4 import BeautifulSoup
import sys
import requests
import time
import re

def worker(num,company_id):
    while True:
        url_home = "http://www.example.com/lastinfo?i={}".format(company_id)
        while True:
            try:
                b = requests.get(url_home,timeout=2.5)
            except:
                time.sleep(2)
            else:
                if "Server" not in b.text and "The service is unavailable." not in b.text:
                    break
                else:
                    time.sleep(2)

        company_document_count = re.findall(r"docCount=(.*),", b.text)[0].split(',')[0]
        print('Worker:',num)
        print("Company ID: "+company_id)
        print("Company Document Count: "+str(company_document_count)+"\n")
        client = MongoClient(host='x', port=x,username="x",password="x")
        db = client['mydb']
        mycollection = db['mycollection']
        last = mycollection.find_one({"company_id": company_id})["info"][0]["document_count"]
        mycollection.update_one({"company_id": company_id,"info.document_count":last}, {"$set": {"info.$":{"document_count":company_document_count}}})
        client.close()
        time.sleep(0.1)

if __name__ == '__main__':
    try:
        ids = []
        jobs = []
        url = "http://www.example.com/allcompanyIds.aspx"
        while True:
            try:
                r = requests.get(url,timeout=2.5)
            except:
                time.sleep(2)
            else:
                break
        ids = set(re.findall(r"\d{15,20}", r.text))

        for index,i in enumerate(ids):
            p = multiprocessing.Process(target=worker, args=(index,i,))
            jobs.append(p)
            p.start()
    except KeyboardInterrupt:
        print('\nExiting by user request.\n')
        sys.exit(0)

Answer 1

我怀疑问题出在这里：

while True:
    try:
        b = requests.get(url_home,timeout=2.5)
    except:
        time.sleep(2)
    else:
        if "Server" not in b.text and "The service is unavailable." not in b.text:
            break
        else:
            time.sleep(2)

如果requests.get()正常工作，它将有效地无限多次发送此消息而不会暂停。 阻止它吃太多资源的方法是包含一个睡眠（这也会阻止你有效地 DDOSing 你正在访问的任何 url），如下所示：

while True:
    try:
        b = requests.get(url_home,timeout=2.5)
    except:
        time.sleep(2)
    else:
        if "Server" not in b.text and "The service is unavailable." not in b.text:
            break
        else:
            time.sleep(2)
    time.sleep(10)

当然，根据您需要此信息的频率，您可能希望改变此睡眠时间。

PS。 根据我的经验，通常应该避免使用while True ，它通常会导致程序卡住，我认为它会使代码更难阅读

我的 python 代码占用了所有 RAM/CPU 资源并使服务器无法访问

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-11-30 10:36:46

我的 python 代码占用了所有 RAM/CPU 资源并使服务器无法访问

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-11-30 10:36:46

解决方案1
0 已采纳 2020-11-30 10:36:46