簡體   English   中英

Selenium Python 抓取產品和顏色的網站

[英]Selenium Python to Scrape a Website For Product and Color

這是我想要抓取的內容的來源

查看來源: https://www.supremenewyork.com/shop/all/jackets

</div></div></li><li><div class="inner-article"><a style="height:150px;" href="/shop/jackets/g84fwstrv/tlxs5mzgi"><img width="150" height="150" src="//assets.supremenewyork.com/189108/vi/2-yV7cMNF3Q.jpg" alt="2 yv7cmnf3q" /></a><div class="product-name"><a class="name-link" href="/shop/jackets/g84fwstrv/tlxs5mzgi">Supreme®/Barbour® Lightweight<br> Waxed Cotton Field Jacket</a></div><div class="product-style"><a class="name-link" href="/shop/jackets/g84fwstrv/tlxs5mzgi">Orange</a>

例如,我希望我的刮刀能夠根據關鍵詞找到任何產品,這里是 Supreme/Barbour Lightweight Waxed Cotton Field Jacket,然后是配色 Orange。 請注意,這是一個示例,產品的網址是動態的,因此我需要能夠每次都獲得我想要的文本,而不僅僅是 xpath 才能單擊該確切鏈接

我建議使用BeautifulSoup 這是一個很好的教程,解釋了我如何使用select和 CSS 選擇器。

CSS 選擇器

這些選擇器是 CSS 語言允許開發人員指定 HTML 標簽樣式的方式。 這里有些例子:

html body — 查找 html 標簽內的所有 body 標簽。

p.outer-text — 查找所有帶有 class 外部文本的 p 標簽。

考慮到 CSS 選擇器,您應該檢查網頁以獲取有用的信息,例如標簽 (a, p, img) 和標識符 (id, class)。 要專門查找鏈接,您可以找到名稱為inner-article的 class 的div標簽,在其中找到a標簽,然后提取href

在此處輸入圖像描述

from bs4 import BeautifulSoup
import requests

page = requests.get("https://www.supremenewyork.com/shop/all/jackets")

soup = BeautifulSoup(page.content, 'html.parser')

base_url = "https://www.supremenewyork.com"

names = soup.select('div.product-name')
styles = soup.select('div.product-style')
links = [base_url + x.find('a')["href"] for x in soup.select('div.inner-article')]

for name,style,link in zip(names, styles, links):
    print(f"Name: {name.text},  Style: {style.text}, Link: {link}")

Output:

Name: Supreme®/Barbour® Lightweight Waxed Cotton Field Jacket,  Style: Leopard, Link: https://www.supremenewyork.com/shop/jackets/g84fwstrv/a9och5sqd
Name: Supreme®/Barbour® Lightweight Waxed Cotton Field Jacket,  Style: Orange, Link: https://www.supremenewyork.com/shop/jackets/g84fwstrv/tlxs5mzgi
Name: Supreme®/Barbour® Lightweight Waxed Cotton Field Jacket,  Style: Black, Link: https://www.supremenewyork.com/shop/jackets/g84fwstrv/uw3m41dl6
Name: Military Trench Coat,  Style: Olive Paisley, Link: https://www.supremenewyork.com/shop/jackets/warmwnguk/vt4hfl7nb
Name: Military Trench Coat,  Style: Peach Paisley, Link: https://www.supremenewyork.com/shop/jackets/warmwnguk/l42els7zp
Name: Military Trench Coat,  Style: Black, Link: https://www.supremenewyork.com/shop/jackets/warmwnguk/agyucqie3
Name: Raglan Court Jacket,  Style: Black, Link: https://www.supremenewyork.com/shop/jackets/df2mva4b6/z5rpqg4is
Name: Raglan Court Jacket,  Style: Flags, Link: https://www.supremenewyork.com/shop/jackets/df2mva4b6/iise068yb
Name: Raglan Court Jacket,  Style: Pale Yellow, Link: https://www.supremenewyork.com/shop/jackets/df2mva4b6/rfkb2ci4n
Name: Raglan Court Jacket,  Style: Olive, Link: https://www.supremenewyork.com/shop/jackets/df2mva4b6/ovblpjzm6
Name: Twill Varsity Jacket,  Style: Light Blue, Link: https://www.supremenewyork.com/shop/jackets/g0qtwiyl1/xbxlunom8
Name: Twill Varsity Jacket,  Style: Black, Link: https://www.supremenewyork.com/shop/jackets/g0qtwiyl1/f1w9ue5vl
Name: Big Letter Track Jacket,  Style: Black, Link: https://www.supremenewyork.com/shop/jackets/olcwsx6yg/dcpah7svl
Name: Big Letter Track Jacket,  Style: Dark Orange, Link: https://www.supremenewyork.com/shop/jackets/olcwsx6yg/p5eiyuxlj

如果您想專門查找名稱和樣式並獲取鏈接,請接收用戶輸入並在 for 循環中添加停止條件。

(此外,我不是 100% 確定為什么有人否決這個問題,但我建議提供清晰的描述、目標和嘗試的代碼。將來,詢問該項目使用什么類型的工具,而不是要求一個SO的完整解決方案)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM