如何使此網絡爬蟲無限？

Question

這是我要編寫的代碼，（一個Web爬蟲，它遍歷鏈接列表，其中第一個鏈接是原始鏈接，然后站點上的鏈接被追加到列表中，並且for循環不斷進行列表，由於某種原因，當追加和打印大約150個鏈接時，腳本會不斷停止）

import requests
from bs4 import BeautifulSoup
import urllib.request

links = ['http://example.com']
def spider(max_pages):
    page = 1
    number = 1
    while page <= max_pages:
        try:
            for LINK in links:
                url = LINK
                source_code = requests.get(url)
                plain_text = source_code.text
                soup = BeautifulSoup(plain_text, "html.parser")
                for link in soup.findAll("a"):
                    try:
                        href = link.get("href")
                        if href.startswith("http"):
                            if href not in links:
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))
                    except:
                        pass

        except Exception as e:
            print(e)

while True:
    spider(10000)

我要怎么做才能使其無限？

Answer 1

當您找到沒有href屬性的<a>元素時，就會發生該錯誤。 在嘗試對其調用startwith之前，應檢查該鏈接是否確實具有href。

Answer 2

薩米爾·查因（Samir Chahine），

您的代碼失敗，因為，因為href變量在

href = link.get("href")

所以在這里放另一張支票：

if (href is not none) and href.startswith("http://")

PLZ轉換python代碼中的邏輯

    try to debug using print statement like :



href = link.get("href")
                        print("href "+ href)
                        if href is not none and href.startswith("http"):
                            print("Condition passed 1")
                            if href not in links:
                                print("Condition passed 2")
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))

如何使此網絡爬蟲無限？

問題描述

2 個解決方案

解決方案1
2 已采納 2015-08-18 12:29:33

解決方案2
0 2015-08-18 12:41:47

如何使此網絡爬蟲無限？

問題描述

2 個解決方案

解決方案1 2 已采納 2015-08-18 12:29:33

解決方案2 0 2015-08-18 12:41:47

解決方案1
2 已采納 2015-08-18 12:29:33

解決方案2
0 2015-08-18 12:41:47