简体   繁体   中英

Web scraping from finishline.com using BS4 or Selenium

I'm trying to scrape data from https://www.finishline.com using either Selenium or Beautifulsoup 4. So far I have been unsuccessful so I've turned to Stackoverflow for assistance - hoping that someone knows a way around their scraping protection.

I tried using Beautifulsoup 4 and Selenium. Below are some simple examples.

General imports used in my main program:

import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

Beautifulsoup 4 code:

data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
soup2 = BeautifulSoup(data2.text, 'html.parser')

x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)

Selenium code:

options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004") 
x = driver.find_element_by_xpath("//h1[1]")
print(x)
driver.close()

Both of those snippets are attempts at getting the product title from the product page.

The Beautifulsoup 4 snippet sometimes just gets stuck and does nothing, and other times it returns

requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')"))

The Selenium snippet returns

<selenium.webdriver.remote.webelement.WebElement (session="b3707fb7d7b201e2fa30dabbedec32c5", element="0.10646785765405364-1")>

which means it did find the element, but when I try to convert it to text by changing

x = driver.find_element_by_xpath("//h1[1]")

to

x = driver.find_element_by_xpath("//h1[1]").text

it returns Access Denied , which is also what the site itself sometimes returns in the browser. It can be bypassed by clearing cookies.

Does anyone know of a way to scrape data from this website? Thanks in advance.

Try as this, for me it works, it returns MEN'S NIKE AIR MAX 95 SE CASUAL SHOES

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()

driver = webdriver.Chrome()
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
x = driver.find_element_by_xpath('//*[@id="title"]')
print(x.text)

The requests is rejected by server because of user agents, i added user agent to the request.

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004",headers=headers)
soup2 = BeautifulSoup(data2.text, 'html.parser')

x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)

Output:

Men's Nike Air Max 95 SE Casual Shoes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM