This is the code I'm trying to write, (A web crawler that loops through a list of links, where the first link is the original and then the links on the sites are appended to the list and the for loop keeps going through the list, for some reason the script keeps stopping when around 150 links are appended and printed)
import requests
from bs4 import BeautifulSoup
import urllib.request
links = ['http://example.com']
def spider(max_pages):
page = 1
number = 1
while page <= max_pages:
try:
for LINK in links:
url = LINK
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
try:
href = link.get("href")
if href.startswith("http"):
if href not in links:
number += 1
links.append(href)
print("{}: {}".format(number, href))
except:
pass
except Exception as e:
print(e)
while True:
spider(10000)
What do I do to make it infinite?
That error looks like it occurs when you find an <a>
element that has no href
attribute. You should check that the link actually has a href before you try to call startswith on it.
Samir Chahine,
Your Code is failing because, because the href variable is none in
href = link.get("href")
so put another check there as :
if (href is not none) and href.startswith("http://")
Plz convert the logic in python code
try to debug using print statement like :
href = link.get("href")
print("href "+ href)
if href is not none and href.startswith("http"):
print("Condition passed 1")
if href not in links:
print("Condition passed 2")
number += 1
links.append(href)
print("{}: {}".format(number, href))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.