使用BS4或Selenium從finishline.com進行網絡抓取

Question

我正在嘗試使用Selenium或Beautifulsoup 4從https://www.finishline.com獲取數據。到目前為止，我一直沒有成功，所以我轉向Stackoverflow尋求幫助 - 希望有人知道如何繞過他們的抓取保護。

我嘗試使用Beautifulsoup 4和Selenium。 以下是一些簡單的例子。

我的主程序中使用的常規導入：

import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

Beautifulsoup 4代碼：

data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
soup2 = BeautifulSoup(data2.text, 'html.parser')

x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)

Selenium代碼：

options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004") 
x = driver.find_element_by_xpath("//h1[1]")
print(x)
driver.close()

這兩個片段都是嘗試從產品頁面獲取產品標題。

Beautifulsoup 4片段有時會被卡住並且什么都不做，有時候它會返回

requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')"))

Selenium片段返回

<selenium.webdriver.remote.webelement.WebElement (session="b3707fb7d7b201e2fa30dabbedec32c5", element="0.10646785765405364-1")>

這意味着它確實找到了元素，但是當我嘗試通過更改將其轉換為文本時

x = driver.find_element_by_xpath("//h1[1]")

至

x = driver.find_element_by_xpath("//h1[1]").text

它返回Access Denied ，這也是網站有時在瀏覽器中返回的內容。 可以通過清除cookie來繞過它。

有誰知道從這個網站獲取數據的方法？ 提前致謝。

Answer 1

試試這個，對我而言，它可以讓MEN'S NIKE AIR MAX 95 SE CASUAL SHOES

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()

driver = webdriver.Chrome()
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
x = driver.find_element_by_xpath('//*[@id="title"]')
print(x.text)

Answer 2

由於用戶代理，服務器拒絕了請求，我將用戶代理添加到請求中。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004",headers=headers)
soup2 = BeautifulSoup(data2.text, 'html.parser')

x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)

輸出：

Men's Nike Air Max 95 SE Casual Shoes

使用BS4或Selenium從finishline.com進行網絡抓取

問題描述

2 個解決方案

解決方案1
1 2019-04-12 13:08:56

解決方案2
1 已采納 2019-04-12 13:13:26

使用BS4或Selenium從finishline.com進行網絡抓取

問題描述

2 個解決方案

解決方案1 1 2019-04-12 13:08:56

解決方案2 1 已采納 2019-04-12 13:13:26

解決方案1
1 2019-04-12 13:08:56

解決方案2
1 已采納 2019-04-12 13:13:26