简体   繁体   English

如何在无序列表中抓取信息 selenium + python

[英]How to scrape information inside an unordered list selenium + python

I am working on a web scraping project, where I try to scrape information from the amazon website.我正在研究 web 抓取项目,我尝试从亚马逊网站上抓取信息。 In the website, there exists an unordered list with such information在网站中,存在一个包含此类信息的无序列表

Item Weight: 17.2 pounds
Shipping Weight: 17.4 pounds (View shipping rates and policies)
ASIN: B00HC767P6
UPC: 766789717088 885720483186 052000201628
Item model number: mark-1hooi-toop842
Customer Reviews: 4.8 out of 5 stars1,352 customer ratings
Amazon Best Sellers Rank: #514 in Grocery & Gourmet Food (See Top 100 in Grocery & Gourmet Food)
#12 in Sports Drinks

The list itself does not have any class to it.该列表本身没有任何 class 。 The problem is I do not want all the information from the list.问题是我不想要列表中的所有信息。 Only the ASIN code.只有 ASIN 代码。 the li tags do not have any specific class or ID to them. li标签没有任何特定的 class 或 ID。 here is the link to the product details page这是产品详细信息页面的链接

Before selenium, I was working with BeautifulSoup and this is how I tackled the issue在 selenium 之前,我正在使用 BeautifulSoup,这就是我解决问题的方法

asin = str(soup.find('bdi', {'dir': 'ltr'}).find_parent('li'))[38:].split('<')[0]

I am now switching to selenium.我现在切换到 selenium。 How do I scrape the information.我如何抓取信息。

You can use the css selector to get the related li item as follow:您可以使用 css 选择器获取相关的 li 项,如下所示:

Finding the child element by index by css selector通过 css 选择器按索引查找子元素

$(".content > ul > li:nth-child(2)").textContent >>> "Shipping Weight: 33 pounds (View shipping rates and policies)"
$(".content > ul > li:nth-child(3)").textContent >>> "ASIN: B07QKN2ZT9"

related python selenium code:相关 python selenium 代码:

driver.find_element_by_css_selector(".content > ul > li:nth-child(3)").text.split(": ")[1] >>> 'B07QKN2ZT9'

Finding the ancestors element by XPATH通过 XPATH 查找祖先元素

If the ASIN is not always in the same index, then you can find the bdi element that has text ASIN text and find its ancestor::li then get its text and extract the related part.如果 ASIN 并不总是在同一个索引中,那么您可以找到具有文本ASIN文本的bdi元素并找到其ancestor::li ,然后获取其文本并提取相关部分。 Like the following:如下所示:

driver.find_element_by_xpath("//bdi[text()='ASIN']/ancestor::li").text.split(": ")[1] >>> 'B07QKN2ZT9'

Generate XPATH生成 XPATH

//<element type>[<attribute type> = <attribute value>]/<descendant>
//bdi[text() = 'ASIN'] >>> bdi element with text 'ASIN'
//bdi[@dir = 'ltr'] >>> bdi element with dir attribute equals to 'ltr'

Access to an ancestor of an element访问元素的祖先

/ancestor::<ancestor element type>
//bdi[text()='ASIN']/ancestor::li >>> li
//bdi[text()='ASIN']/ancestor::ul >>> ul

访问元素的祖先

You can check this as a reference您可以检查此作为参考

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM