简体   繁体   English

python脚本一直在运行

[英]The python script is continuously running

I am trying to build a web crawler to extract all the links on a webpage.我正在尝试构建一个网络爬虫来提取网页上的所有链接。 I have created 2 python files.我创建了 2 个 python 文件。 (class: scanner.py and object: vulnerability-scanner.py). (类:scanner.py 和对象:vulnerable-scanner.py)。 When I run the script, it is continuously running without stopping.当我运行脚本时,它一直在不停地运行。 I am unable to find the error.我无法找到错误。 Help me to solve this.帮我解决这个问题。

Here is my source code:这是我的源代码:

scanner.py扫描仪.py

import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama

class Scanner:

    colorama.init()

    def __init__(self, url):
        self.target_url = url
        self.target_links = []

    def is_valid(self, url):
        parsed = urlparse(url)
        return bool(parsed.netloc) and bool(parsed.scheme)

    def get_all_website_links(self, url):

        GREEN = colorama.Fore.GREEN
        WHITE = colorama.Fore.WHITE
        RESET = colorama.Fore.RESET

        urls = set()
        internal_urls = set()
        external_urls = set()
        domain_name = urlparse(url).netloc
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        for a_tag in soup.findAll("a"):
            href = a_tag.attrs.get("href")
            if href == "" or href is None:
                continue
            href = urljoin(url, href)
            parsed_href = urlparse(href)
            href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path

            if not self.is_valid(href):
                continue
            if href in internal_urls:
                continue
            if domain_name not in href:
                if href not in external_urls:
                    print(f"{WHITE}[*] External link: {href}{RESET}")
                    external_urls.add(href)
                continue
            print(f"{GREEN}[*] Internal link: {href}{RESET}")
            urls.add(href)
            internal_urls.add(href)
        return urls

    def crawl(self, url):
        href_links = self.get_all_website_links(url)
        for link in href_links:
            print(link)
            self.crawl(link)

vulnerability-scanner.py漏洞扫描器.py

import argu

target_url = "https://hack.me/"
vul_scanner = argu.Scanner(target_url)
vul_scanner.crawl(target_url)

The following part is (almost) an infinite recursion:以下部分是(几乎)无限递归:

for link in href_links:
    print(link)
    self.crawl(link)

I believe you added this on the notion of crawling the links in the page.我相信您在抓取页面中的链接的概念上添加了这一点。 But you didn't put a stopping condition.但是你没有设置停止条件。 (Although currently, it seems like your only stopping condition is if there is a crawled page with no links at all). (尽管目前,似乎您唯一的停止条件是有一个完全没有链接的已爬网页面)。

One stopping condition might be to set a predefined number of "max" levels to crawl.一种停止条件可能是设置预定义数量的“最大”级别以进行爬网。

Something like this in your init function:在你的 init 函数中是这样的:

def __init__(self, url):
    self.target_url = url
    self.target_links = []
    self.max_parse_levels = 5 #you can go a step further and make this as an input to the constructore (i.e. __init__ function)
    self.cur_parse_levels = 0
.
.
.

def crawl(url):
    if self.cur_parse_levels > self.max_parse_levels:
        return
    for link in href_links:
        print(link)
        self.crawl(link)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:连续运行脚本的一部分 - Python: Continuously running a section of a script 在服务器上连续运行Python脚本 - Running a Python Script Continuously on Server 使用heroku bash连续运行python脚本 - Running python script with heroku bash continuously 在 Heroku 上连续运行简单的 python 脚本 - Running simple python script continuously on Heroku 从系统启动开始连续运行python脚本 - Running a python script continuously, from system startup 通过外部和可交换的python脚本与持续运行的Python Gui进行交互 - Interact with continuously running Python Gui by external and exchangeable python script 2 个 python 脚本之间的通信,其中第一个脚本持续运行 - Communication between 2 python script in which the first one is continuously running 连续运行一个脚本,同时安排另一个 - Running a script continuously while schedule the other 如何使 Python 脚本在计算机系统处于睡眠模式时连续运行? (苹果系统) - How to make Python script running continuously while computer system in sleeping mode? (MacOS) 如何为多个用户提供持续运行的python脚本(Social Media Bot) - How to serve a continuously running python script to multiple users (Social Media Bot)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM