[英]Can't find proper class for webscraping on Etsy
我正在嘗試從 Etsy 中抓取產品信息,並且正在按照一個相對簡單的教程進行操作。
這是我當前的代碼:
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
uclient = ureq(url)
page_html = uclient.read()
#html parsing
page_soup = soup(page_html, 'lxml')
print(page_soup.p)
#grabs each product
listings = page_soup.findAll("li", {"class":"wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder"})
len(listings)
最后一步重復輸出0,專門針對這個class,所以不知道自己做錯了什么。 根據檢查代碼,這是要使用的適當 class 名稱和 css class 類型。 Etsy 在此處檢查代碼
非常感謝任何幫助:謝謝(-:
bs4 的特質(或者我可能不完全理解它......),試試這個:
listings = page_soup.find_all("li", class_="wt-list-unstyled wt-grid__item-xs-6 wt-grid__item-md-4 wt-grid__item-lg-3 wt-order-xs-0 wt-order-sm-0 wt-order-md-0 wt-order-lg-0 wt-order-xl-0 wt-order-tv-0 grid__item-xl-fifth tab-reorder")
我可以使用更簡單的方式在頁面上獲得 65 個項目
soup.find("div", {"class": "tab-reorder-container"}).find_all("li", {"class":"tab-reorder"})
首先,我使用find()
來獲取包含所有項目的區域,然后我使用find_all()
在該區域中僅查找li
。
import requests
from bs4 import BeautifulSoup as BS
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
r = requests.get(url, headers=headers)
soup = BS(r.text, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
但問題是此頁面使用JavaScript
將項目添加到頁面並找到 65 個項目,但其中大部分是空的,因為BS
無法運行JavaScript
將所有值添加到 Z4C4AD5FCA2E7A3F74DBB1CED0031。
它可能需要使用Selenium來控制真正的 web 瀏覽器,它可以運行JavaScript
。 Or it may need to check if other data are somewhere in JavaScript
on page or if JavaScript
doesn't read data from other url - and then you can use this url with requests
編輯:
Version which use Selenium to load page in Chrome/Firefox, close popup window, scroll it to the end of page, and get elements with BeautifulSoup
and without BeautifulSoup
from bs4 import BeautifulSoup as BS
import selenium.webdriver
import time
#opening up connection, grabbing url
url = "https://www.etsy.com/sg-en/search/bath-and-beauty/soaps?q=green+beauty&explicit=1&ref=pagination&page=1"
driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get(url)
time.sleep(3)
driver.find_element_by_xpath('//button[@data-gdpr-single-choice-accept]').click()
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(1.5)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
print('--- version 1 - BeautifulSoup ---')
html = driver.page_source
soup = BS(html, 'lxml')
print(soup.p)
#grabs each product
listings = soup.find('div', {'class': 'tab-reorder-container'}).find_all("li", {"class":"tab-reorder"})
print(len(listings))
for item in listings:
item = item.find('h3')
if item:
print(item.get_text(strip=True))
print('--- version 2 - Selenium ---')
#grabs each product
listings = driver.find_elements_by_css_selector('div.tab-reorder-container li.tab-reorder')
print(len(listings))
for item in listings:
item = item.find_element_by_css_selector('h3')
if item:
print(item.text.strip())
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.