简体   繁体   English

在 Etsy 上找不到合适的 class 进行网络抓取

[英]Can't find proper class for webscraping on Etsy

I'm trying to scrape product information from Etsy, and am following a relatively simple tutorial to do so.我正在尝试从 Etsy 中抓取产品信息,并且正在按照一个相对简单的教程进行操作。

This is my current code:这是我当前的代码:

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()

#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)

#grabs each product 
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)

The last step repeatedly outputs 0, specifically for this class, so I'm not sure what I'm doing wrong.最后一步重复输出0,专门针对这个class,所以不知道自己做错了什么。 Based on the inspect code, this is the appropriate class name & css class type to be using.根据检查代码,这是要使用的适当 class 名称和 css class 类型。 Etsy Inspect Code here Etsy 在此处检查代码

Would really appreciate any help: Thanks (-:非常感谢任何帮助:谢谢(-:

An idiosyncrasy of bs4 (or maybe I don't fully understand it...), try this instead: bs4 的特质(或者我可能不完全理解它......),试试这个:

listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")

I can get 65 items like on page using simpler我可以使用更简单的方式在页面上获得 65 个项目

soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})

First I use find() to get region with all items and later I use find_all() to find only li in this region.首先,我使用find()来获取包含所有项目的区域,然后我使用find_all()在该区域中仅查找li

import requests
from bs4 import BeautifulSoup as BS

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

But problem is this page uses JavaScript to add items to page and it finds 65 items but most of them are empty because BS can't run JavaScript to add all values to HTML.但问题是此页面使用JavaScript将项目添加到页面并找到 65 个项目,但其中大部分是空的,因为BS无法运行JavaScript将所有值添加到 Z4C4AD5FCA2E7A3F74DBB1CED0031。

It may need to use Selenium to control real web browser which can run JavaScript .它可能需要使用Selenium来控制真正的 web 浏览器,它可以运行JavaScript Or it may need to check if other data are somewhere in JavaScript on page or if JavaScript doesn't read data from other url - and then you can use this url with requests Or it may need to check if other data are somewhere in JavaScript on page or if JavaScript doesn't read data from other url - and then you can use this url with requests


EDIT:编辑:

Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup and without BeautifulSoup Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup and without BeautifulSoup

from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)

time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(1.5)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

print('--- version 1 - BeautifulSoup ---')

html = driver.page_source

soup = BS(html, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

print('--- version 2 - Selenium ---')

#grabs each product 
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))

for item in listings:
    item = item.find_element_by_css_selector('h3')
    if item:
        print(item.text.strip())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM