简体   繁体   中英

How do I make this Web Crawler infinite?

This is the code I'm trying to write, (A web crawler that loops through a list of links, where the first link is the original and then the links on the sites are appended to the list and the for loop keeps going through the list, for some reason the script keeps stopping when around 150 links are appended and printed)

import requests
from bs4 import BeautifulSoup
import urllib.request

links = ['http://example.com']
def spider(max_pages):
    page = 1
    number = 1
    while page <= max_pages:
        try:
            for LINK in links:
                url = LINK
                source_code = requests.get(url)
                plain_text = source_code.text
                soup = BeautifulSoup(plain_text, "html.parser")
                for link in soup.findAll("a"):
                    try:
                        href = link.get("href")
                        if href.startswith("http"):
                            if href not in links:
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))
                    except:
                        pass

        except Exception as e:
            print(e)

while True:
    spider(10000)

What do I do to make it infinite?

That error looks like it occurs when you find an <a> element that has no href attribute. You should check that the link actually has a href before you try to call startswith on it.

Samir Chahine,

Your Code is failing because, because the href variable is none in

href = link.get("href")

so put another check there as :

if (href is not none) and href.startswith("http://")

Plz convert the logic in python code

    try to debug using print statement like :



href = link.get("href")
                        print("href "+ href)
                        if href is not none and href.startswith("http"):
                            print("Condition passed 1")
                            if href not in links:
                                print("Condition passed 2")
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM