简体   繁体   English

如何使此网络爬虫无限?

[英]How do I make this Web Crawler infinite?

This is the code I'm trying to write, (A web crawler that loops through a list of links, where the first link is the original and then the links on the sites are appended to the list and the for loop keeps going through the list, for some reason the script keeps stopping when around 150 links are appended and printed) 这是我要编写的代码,(一个Web爬虫,它遍历链接列表,其中第一个链接是原始链接,然后站点上的链接被追加到列表中,并且for循环不断进行列表,由于某种原因,当追加和打印大约150个链接时,脚本会不断停止)

import requests
from bs4 import BeautifulSoup
import urllib.request

links = ['http://example.com']
def spider(max_pages):
    page = 1
    number = 1
    while page <= max_pages:
        try:
            for LINK in links:
                url = LINK
                source_code = requests.get(url)
                plain_text = source_code.text
                soup = BeautifulSoup(plain_text, "html.parser")
                for link in soup.findAll("a"):
                    try:
                        href = link.get("href")
                        if href.startswith("http"):
                            if href not in links:
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))
                    except:
                        pass

        except Exception as e:
            print(e)

while True:
    spider(10000)

What do I do to make it infinite? 我要怎么做才能使其无限?

That error looks like it occurs when you find an <a> element that has no href attribute. 当您找到没有href属性的<a>元素时,就会发生该错误。 You should check that the link actually has a href before you try to call startswith on it. 在尝试对其调用startwith之前,应检查该链接是否确实具有href。

Samir Chahine, 萨米尔·查因(Samir Chahine),

Your Code is failing because, because the href variable is none in 您的代码失败,因为,因为href变量在

href = link.get("href")

so put another check there as : 所以在这里放另一张支票:

if (href is not none) and href.startswith("http://")

Plz convert the logic in python code PLZ转换python代码中的逻辑

    try to debug using print statement like :



href = link.get("href")
                        print("href "+ href)
                        if href is not none and href.startswith("http"):
                            print("Condition passed 1")
                            if href not in links:
                                print("Condition passed 2")
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM