在 Etsy 上找不到合適的 class 進行網絡抓取

Question

我正在嘗試從 Etsy 中抓取產品信息，並且正在按照一個相對簡單的教程進行操作。

這是我當前的代碼：

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()

#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)

#grabs each product 
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)

最后一步重復輸出0，專門針對這個class，所以不知道自己做錯了什么。 根據檢查代碼，這是要使用的適當 class 名稱和 css class 類型。 Etsy 在此處檢查代碼

非常感謝任何幫助：謝謝（-：

Answer 1

bs4 的特質（或者我可能不完全理解它......），試試這個：

listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")

Answer 2

我可以使用更簡單的方式在頁面上獲得 65 個項目

soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})

首先，我使用find()來獲取包含所有項目的區域，然后我使用find_all()在該區域中僅查找li 。

import requests
from bs4 import BeautifulSoup as BS

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

但問題是此頁面使用JavaScript將項目添加到頁面並找到 65 個項目，但其中大部分是空的，因為BS無法運行JavaScript將所有值添加到 Z4C4AD5FCA2E7A3F74DBB1CED0031。

它可能需要使用Selenium來控制真正的 web 瀏覽器，它可以運行JavaScript 。 Or it may need to check if other data are somewhere in JavaScript on page or if JavaScript doesn't read data from other url - and then you can use this url with requests

編輯：

Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup and without BeautifulSoup

from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)

time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(1.5)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

print('--- version 1 - BeautifulSoup ---')

html = driver.page_source

soup = BS(html, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

print('--- version 2 - Selenium ---')

#grabs each product 
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))

for item in listings:
    item = item.find_element_by_css_selector('h3')
    if item:
        print(item.text.strip())

在 Etsy 上找不到合適的 class 進行網絡抓取

問題描述

2 個解決方案

解決方案1
0 2021-01-21 06:17:51

解決方案2
0 已采納 2021-01-21 06:23:18

在 Etsy 上找不到合適的 class 進行網絡抓取

問題描述

2 個解決方案

解決方案1 0 2021-01-21 06:17:51

解決方案2 0 已采納 2021-01-21 06:23:18

解決方案1
0 2021-01-21 06:17:51

解決方案2
0 已采納 2021-01-21 06:23:18