简体   繁体   English

在Python中使用硒抓取网站时拒绝访问

[英]Access denied while scraping a website with selenium in Python

Hi I'm trying to extract information from Macy's website, specifically from this category = ' https://www.macys.com/shop/featured/women-handbags '. 嗨,我正在尝试从梅西百货的网站中提取信息,尤其是从此类别=' https://www.macys.com/shop/featured/women-handbags '中提取信息。 But when I access a particular item page I get a blank page with the following message: 但是,当我访问特定的项目页面时,我得到一个空白页面,并显示以下消息:

Access Denied You don't have permission to access "any of the items links listed on the above category link" on this server. 拒绝访问您无权访问此服务器上的“以上类别链接中列出的任何项目链接”。 Reference #18.14d6f7bd.1526927300.12232a22 参考#18.14d6f7bd.1526927300.12232a22

I've also tried changing the user agent with chrome options but it didn't work. 我也尝试过使用chrome选项更改用户代理,但这没有用。

This is my code: 这是我的代码:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = 'https://www.macys.com/shop/featured/women-handbags'

def init_selenium():
    global driver
    driver = webdriver.Chrome("/Users/rodrigopeniche/Downloads/chromedriver")
    driver.get(url)

def find_page_items():
    items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
    for index, element in enumerate(items_elements):
    items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
    item_link = items_elements[index].find_element_by_tag_name('a').get_attribute('href')
    driver.get(item_link)
    driver.back()


init_selenium()
find_page_items()

Any idea what's going on and what can I do to fix it? 您知道发生了什么,该如何解决?

It's not a selenium oriented solution (all through) but it works. 它不是一种面向硒的解决方案,但可以使用。 You can give it a try. 您可以尝试一下。

from selenium import webdriver 
import requests
from bs4 import BeautifulSoup

url = 'https://www.macys.com/shop/featured/women-handbags'

def find_page_items(driver,link):
    driver.get(link)
    item_link = [item.find_element_by_tag_name('a').get_attribute('href') for item in driver.find_elements_by_css_selector('li.productThumbnailItem')]
    for newlink in item_link:
        res = requests.get(newlink,headers={"User-Agent":"Mozilla/5.0"})
        soup = BeautifulSoup(res.text,"lxml")
        name = soup.select_one("h1[itemprop='name']").text.strip()
        print(name)

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        find_page_items(driver,url)
    finally:
        driver.quit()

Output: 输出:

Mercer Medium Bonded-Leather Crossbody
Mercer Large Tote
Nolita Medium Satchel
Voyager Medium Multifunction Top-Zip Tote
Mercer Medium Crossbody
Kelsey Large Crossbody
Medium Mercer Gallery
Mercer Large Center Tote
Signature Raven Large Tote

However, If you stick to selenium then you need to create new instance of it everytime you browse a new url or may be a better option is to clear the cache. 但是,如果您坚持使用硒,那么每次浏览新的url时都需要创建它的新实例,或者更好的选择是清除缓存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM