简体   繁体   English

使用BS4或Selenium从finishline.com进行网络抓取

[英]Web scraping from finishline.com using BS4 or Selenium

I'm trying to scrape data from https://www.finishline.com using either Selenium or Beautifulsoup 4. So far I have been unsuccessful so I've turned to Stackoverflow for assistance - hoping that someone knows a way around their scraping protection. 我正在尝试使用Selenium或Beautifulsoup 4从https://www.finishline.com获取数据。到目前为止,我一直没有成功,所以我转向Stackoverflow寻求帮助 - 希望有人知道如何绕过他们的抓取保护。

I tried using Beautifulsoup 4 and Selenium. 我尝试使用Beautifulsoup 4和Selenium。 Below are some simple examples. 以下是一些简单的例子。

General imports used in my main program: 我的主程序中使用的常规导入:

import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

Beautifulsoup 4 code: Beautifulsoup 4代码:

data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
soup2 = BeautifulSoup(data2.text, 'html.parser')

x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)

Selenium code: Selenium代码:

options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004") 
x = driver.find_element_by_xpath("//h1[1]")
print(x)
driver.close()

Both of those snippets are attempts at getting the product title from the product page. 这两个片段都是尝试从产品页面获取产品标题。

The Beautifulsoup 4 snippet sometimes just gets stuck and does nothing, and other times it returns Beautifulsoup 4片段有时会被卡住并且什么都不做,有时候它会返回

requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')"))

The Selenium snippet returns Selenium片段返回

<selenium.webdriver.remote.webelement.WebElement (session="b3707fb7d7b201e2fa30dabbedec32c5", element="0.10646785765405364-1")>

which means it did find the element, but when I try to convert it to text by changing 这意味着它确实找到了元素,但是当我尝试通过更改将其转换为文本时

x = driver.find_element_by_xpath("//h1[1]")

to

x = driver.find_element_by_xpath("//h1[1]").text

it returns Access Denied , which is also what the site itself sometimes returns in the browser. 它返回Access Denied ,这也是网站有时在浏览器中返回的内容。 It can be bypassed by clearing cookies. 可以通过清除cookie来绕过它。

Does anyone know of a way to scrape data from this website? 有谁知道从这个网站获取数据的方法? Thanks in advance. 提前致谢。

Try as this, for me it works, it returns MEN'S NIKE AIR MAX 95 SE CASUAL SHOES 试试这个,对我而言,它可以让MEN'S NIKE AIR MAX 95 SE CASUAL SHOES

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()

driver = webdriver.Chrome()
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
x = driver.find_element_by_xpath('//*[@id="title"]')
print(x.text)

The requests is rejected by server because of user agents, i added user agent to the request. 由于用户代理,服务器拒绝了请求,我将用户代理添加到请求中。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004",headers=headers)
soup2 = BeautifulSoup(data2.text, 'html.parser')

x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)

Output: 输出:

Men's Nike Air Max 95 SE Casual Shoes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM