简体   繁体   中英

How to scrape information inside an unordered list selenium + python

I am working on a web scraping project, where I try to scrape information from the amazon website. In the website, there exists an unordered list with such information

Item Weight: 17.2 pounds
Shipping Weight: 17.4 pounds (View shipping rates and policies)
ASIN: B00HC767P6
UPC: 766789717088 885720483186 052000201628
Item model number: mark-1hooi-toop842
Customer Reviews: 4.8 out of 5 stars1,352 customer ratings
Amazon Best Sellers Rank: #514 in Grocery & Gourmet Food (See Top 100 in Grocery & Gourmet Food)
#12 in Sports Drinks

The list itself does not have any class to it. The problem is I do not want all the information from the list. Only the ASIN code. the li tags do not have any specific class or ID to them. here is the link to the product details page

Before selenium, I was working with BeautifulSoup and this is how I tackled the issue

asin = str(soup.find('bdi', {'dir': 'ltr'}).find_parent('li'))[38:].split('<')[0]

I am now switching to selenium. How do I scrape the information.

You can use the css selector to get the related li item as follow:

Finding the child element by index by css selector

$(".content > ul > li:nth-child(2)").textContent >>> "Shipping Weight: 33 pounds (View shipping rates and policies)"
$(".content > ul > li:nth-child(3)").textContent >>> "ASIN: B07QKN2ZT9"

related python selenium code:

driver.find_element_by_css_selector(".content > ul > li:nth-child(3)").text.split(": ")[1] >>> 'B07QKN2ZT9'

Finding the ancestors element by XPATH

If the ASIN is not always in the same index, then you can find the bdi element that has text ASIN text and find its ancestor::li then get its text and extract the related part. Like the following:

driver.find_element_by_xpath("//bdi[text()='ASIN']/ancestor::li").text.split(": ")[1] >>> 'B07QKN2ZT9'

Generate XPATH

//<element type>[<attribute type> = <attribute value>]/<descendant>
//bdi[text() = 'ASIN'] >>> bdi element with text 'ASIN'
//bdi[@dir = 'ltr'] >>> bdi element with dir attribute equals to 'ltr'

Access to an ancestor of an element

/ancestor::<ancestor element type>
//bdi[text()='ASIN']/ancestor::li >>> li
//bdi[text()='ASIN']/ancestor::ul >>> ul

访问元素的祖先

You can check this as a reference

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM