我编写了一个代码,该代码从指定的URL中提取所有链接。 我从在线视频教程中汲取了这个想法。 当我尝试nytimes.com时,是否可以解决。 但是,当我与yell.com试过,我有一个错误抛出:“错误:HTTP错误416:请求范围无法满足- http://www.yell.com/ ”。 我应该采用什么技术来绕过这一点。

import urllib.parse;
import urllib;
from bs4 import BeautifulSoup;

##url = "http://nytimes.com";
url = "http://www.yell.com/";

urls = [url];   
visited = [url];

while(len(urls) > 0):

    try:
        htmltext = urllib.request.urlopen(urls[0]).read();

        soup = BeautifulSoup(htmltext);

        urls.pop(0);
        print(len(urls));

        for tag in soup.findAll('a',href=True) :
            tag['href'] = urllib.parse.urljoin(url,tag['href']);
            if(url in tag['href'] and tag['href'] not in visited) :
                urls.append(tag['href']);
                visited.append(tag['href']);

    except urllib.error.HTTPError as e:
        print("Error: " + str(e)
              + " - " + url);

print(visited);

===============>>#1 票数:0 已采纳

此处发生的是yell.com正在检测不正常的活动。 如果您尝试使用硒直观地进行抓取,则其加载Javascript的方式为:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load

# at this point, if you see the Firefox window that opened you will see the message

# Anyway, if you manage to get pass trough that blocking, you could load BeautifulSoup this way: 
soup = BeautifulSoup(driver.page_source)

  ask by translate from so

未解决问题?本站智能推荐: