簡體   English   中英

在 Etsy 上找不到合適的 class 進行網絡抓取

[英]Can't find proper class for webscraping on Etsy

我正在嘗試從 Etsy 中抓取產品信息,並且正在按照一個相對簡單的教程進行操作。

這是我當前的代碼:

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()

#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)

#grabs each product 
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)

最后一步重復輸出0,專門針對這個class,所以不知道自己做錯了什么。 根據檢查代碼,這是要使用的適當 class 名稱和 css class 類型。 Etsy 在此處檢查代碼

非常感謝任何幫助:謝謝(-:

bs4 的特質(或者我可能不完全理解它......),試試這個:

listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")

我可以使用更簡單的方式在頁面上獲得 65 個項目

soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})

首先,我使用find()來獲取包含所有項目的區域,然后我使用find_all()在該區域中僅查找li

import requests
from bs4 import BeautifulSoup as BS

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

但問題是此頁面使用JavaScript將項目添加到頁面並找到 65 個項目,但其中大部分是空的,因為BS無法運行JavaScript將所有值添加到 Z4C4AD5FCA2E7A3F74DBB1CED0031。

它可能需要使用Selenium來控制真正的 web 瀏覽器,它可以運行JavaScript Or it may need to check if other data are somewhere in JavaScript on page or if JavaScript doesn't read data from other url - and then you can use this url with requests


編輯:

Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup and without BeautifulSoup

from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)

time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(1.5)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

print('--- version 1 - BeautifulSoup ---')

html = driver.page_source

soup = BS(html, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

print('--- version 2 - Selenium ---')

#grabs each product 
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))

for item in listings:
    item = item.find_element_by_css_selector('h3')
    if item:
        print(item.text.strip())

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM