[英]Can't find proper class for webscraping on Etsy
我正在尝试从 Etsy 中抓取产品信息,并且正在按照一个相对简单的教程进行操作。
这是我当前的代码:
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()
#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)
#grabs each product
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)
最后一步重复输出0,专门针对这个class,所以不知道自己做错了什么。 根据检查代码,这是要使用的适当 class 名称和 css class 类型。 Etsy 在此处检查代码
非常感谢任何帮助:谢谢(-:
bs4 的特质(或者我可能不完全理解它......),试试这个:
listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")
我可以使用更简单的方式在页面上获得 65 个项目
soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})
首先,我使用find()
来获取包含所有项目的区域,然后我使用find_all()
在该区域中仅查找li
。
import requests
from bs4 import BeautifulSoup as BS
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
但问题是此页面使用JavaScript
将项目添加到页面并找到 65 个项目,但其中大部分是空的,因为BS
无法运行JavaScript
将所有值添加到 Z4C4AD5FCA2E7A3F74DBB1CED0031。
它可能需要使用Selenium来控制真正的 web 浏览器,它可以运行JavaScript
。 Or it may need to check if other data are somewhere in JavaScript
on page or if JavaScript
doesn't read data from other url - and then you can use this url with requests
编辑:
Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup
and without BeautifulSoup
from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)
time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(1.5)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
print('--- version 1 - BeautifulSoup ---')
html = driver.page_source
soup = BS(html, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
print('--- version 2 - Selenium ---')
#grabs each product
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))
for item in listings:
item = item.find_element_by_css_selector('h3')
if item:
print(item.text.strip())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.