[英]Can't find proper class for webscraping on Etsy
I'm trying to scrape product information from Etsy, and am following a relatively simple tutorial to do so.我正在尝试从 Etsy 中抓取产品信息,并且正在按照一个相对简单的教程进行操作。
This is my current code:这是我当前的代码:
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()
#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)
#grabs each product
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)
The last step repeatedly outputs 0, specifically for this class, so I'm not sure what I'm doing wrong.最后一步重复输出0,专门针对这个class,所以不知道自己做错了什么。 Based on the inspect code, this is the appropriate class name & css class type to be using.
根据检查代码,这是要使用的适当 class 名称和 css class 类型。 Etsy Inspect Code here
Etsy 在此处检查代码
Would really appreciate any help: Thanks (-:非常感谢任何帮助:谢谢(-:
An idiosyncrasy of bs4 (or maybe I don't fully understand it...), try this instead: bs4 的特质(或者我可能不完全理解它......),试试这个:
listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")
I can get 65 items like on page using simpler我可以使用更简单的方式在页面上获得 65 个项目
soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})
First I use find()
to get region with all items and later I use find_all()
to find only li
in this region.首先,我使用
find()
来获取包含所有项目的区域,然后我使用find_all()
在该区域中仅查找li
。
import requests
from bs4 import BeautifulSoup as BS
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
But problem is this page uses JavaScript
to add items to page and it finds 65 items but most of them are empty because BS
can't run JavaScript
to add all values to HTML.但问题是此页面使用
JavaScript
将项目添加到页面并找到 65 个项目,但其中大部分是空的,因为BS
无法运行JavaScript
将所有值添加到 Z4C4AD5FCA2E7A3F74DBB1CED0031。
It may need to use Selenium to control real web browser which can run JavaScript
.它可能需要使用Selenium来控制真正的 web 浏览器,它可以运行
JavaScript
。 Or it may need to check if other data are somewhere in JavaScript
on page or if JavaScript
doesn't read data from other url - and then you can use this url with requests
Or it may need to check if other data are somewhere in
JavaScript
on page or if JavaScript
doesn't read data from other url - and then you can use this url with requests
EDIT:编辑:
Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup
and without BeautifulSoup
Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with
BeautifulSoup
and without BeautifulSoup
from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)
time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(1.5)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
print('--- version 1 - BeautifulSoup ---')
html = driver.page_source
soup = BS(html, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
print('--- version 2 - Selenium ---')
#grabs each product
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))
for item in listings:
item = item.find_element_by_css_selector('h3')
if item:
print(item.text.strip())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.