[英]How to scrape information inside an unordered list selenium + python
I am working on a web scraping project, where I try to scrape information from the amazon website.我正在研究 web 抓取项目,我尝试从亚马逊网站上抓取信息。 In the website, there exists an unordered list with such information在网站中,存在一个包含此类信息的无序列表
Item Weight: 17.2 pounds
Shipping Weight: 17.4 pounds (View shipping rates and policies)
ASIN: B00HC767P6
UPC: 766789717088 885720483186 052000201628
Item model number: mark-1hooi-toop842
Customer Reviews: 4.8 out of 5 stars1,352 customer ratings
Amazon Best Sellers Rank: #514 in Grocery & Gourmet Food (See Top 100 in Grocery & Gourmet Food)
#12 in Sports Drinks
The list itself does not have any class to it.该列表本身没有任何 class 。 The problem is I do not want all the information from the list.问题是我不想要列表中的所有信息。 Only the ASIN code.只有 ASIN 代码。 the li
tags do not have any specific class or ID to them. li
标签没有任何特定的 class 或 ID。 here is the link to the product details page这是产品详细信息页面的链接
Before selenium, I was working with BeautifulSoup and this is how I tackled the issue在 selenium 之前,我正在使用 BeautifulSoup,这就是我解决问题的方法
asin = str(soup.find('bdi', {'dir': 'ltr'}).find_parent('li'))[38:].split('<')[0]
I am now switching to selenium.我现在切换到 selenium。 How do I scrape the information.我如何抓取信息。
You can use the css selector to get the related li item as follow:您可以使用 css 选择器获取相关的 li 项,如下所示:
$(".content > ul > li:nth-child(2)").textContent >>> "Shipping Weight: 33 pounds (View shipping rates and policies)"
$(".content > ul > li:nth-child(3)").textContent >>> "ASIN: B07QKN2ZT9"
related python selenium code:相关 python selenium 代码:
driver.find_element_by_css_selector(".content > ul > li:nth-child(3)").text.split(": ")[1] >>> 'B07QKN2ZT9'
If the ASIN is not always in the same index, then you can find the bdi
element that has text ASIN
text and find its ancestor::li
then get its text and extract the related part.如果 ASIN 并不总是在同一个索引中,那么您可以找到具有文本ASIN
文本的bdi
元素并找到其ancestor::li
,然后获取其文本并提取相关部分。 Like the following:如下所示:
driver.find_element_by_xpath("//bdi[text()='ASIN']/ancestor::li").text.split(": ")[1] >>> 'B07QKN2ZT9'
//<element type>[<attribute type> = <attribute value>]/<descendant>
//bdi[text() = 'ASIN'] >>> bdi element with text 'ASIN'
//bdi[@dir = 'ltr'] >>> bdi element with dir attribute equals to 'ltr'
/ancestor::<ancestor element type>
//bdi[text()='ASIN']/ancestor::li >>> li
//bdi[text()='ASIN']/ancestor::ul >>> ul
You can check this as a reference您可以检查此作为参考
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.