I'm trying to scrape product information from Etsy, and am following a relatively simple tutorial to do so.
This is my current code:
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()
#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)
#grabs each product
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)
The last step repeatedly outputs 0, specifically for this class, so I'm not sure what I'm doing wrong. Based on the inspect code, this is the appropriate class name & css class type to be using. Etsy Inspect Code here
Would really appreciate any help: Thanks (-:
An idiosyncrasy of bs4 (or maybe I don't fully understand it...), try this instead:
listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")
I can get 65 items like on page using simpler
soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})
First I use find()
to get region with all items and later I use find_all()
to find only li
in this region.
import requests
from bs4 import BeautifulSoup as BS
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
But problem is this page uses JavaScript
to add items to page and it finds 65 items but most of them are empty because BS
can't run JavaScript
to add all values to HTML.
It may need to use Selenium to control real web browser which can run JavaScript
. Or it may need to check if other data are somewhere in JavaScript
on page or if JavaScript
doesn't read data from other url - and then you can use this url with requests
EDIT:
Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup
and without BeautifulSoup
from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)
time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(1.5)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
print('--- version 1 - BeautifulSoup ---')
html = driver.page_source
soup = BS(html, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
print('--- version 2 - Selenium ---')
#grabs each product
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))
for item in listings:
item = item.find_element_by_css_selector('h3')
if item:
print(item.text.strip())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.