简体   繁体   中英

Can't find proper class for webscraping on Etsy

I'm trying to scrape product information from Etsy, and am following a relatively simple tutorial to do so.

This is my current code:

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()

#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)

#grabs each product 
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)

The last step repeatedly outputs 0, specifically for this class, so I'm not sure what I'm doing wrong. Based on the inspect code, this is the appropriate class name & css class type to be using. Etsy Inspect Code here

Would really appreciate any help: Thanks (-:

An idiosyncrasy of bs4 (or maybe I don't fully understand it...), try this instead:

listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")

I can get 65 items like on page using simpler

soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})

First I use find() to get region with all items and later I use find_all() to find only li in this region.

import requests
from bs4 import BeautifulSoup as BS

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

But problem is this page uses JavaScript to add items to page and it finds 65 items but most of them are empty because BS can't run JavaScript to add all values to HTML.

It may need to use Selenium to control real web browser which can run JavaScript . Or it may need to check if other data are somewhere in JavaScript on page or if JavaScript doesn't read data from other url - and then you can use this url with requests


EDIT:

Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup and without BeautifulSoup

from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time

#opening up connection, grabbing url 
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"

driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)

time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(1.5)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

print('--- version 1 - BeautifulSoup ---')

html = driver.page_source

soup = BS(html, 'lxml')
print(soup.p)

#grabs each product 
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))

for item in listings:
    item = item.find('h3')
    if item:
        print(item.get_text(strip=True))

print('--- version 2 - Selenium ---')

#grabs each product 
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))

for item in listings:
    item = item.find_element_by_css_selector('h3')
    if item:
        print(item.text.strip())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM