简体   繁体   中英

C# HtmlAgilityPack - Scraping

I want to use HtmlAgilityPack to scrape content from GSMArena.com, specifically, I want to scrape the technical specifications of cell phones.

Desired Outcome:

http://www.gsmarena.com/nokia_lumia_520-5322.php I would want to scrape the weight, dimensions, etc

Issue: The node path will be different between just about all models.

My Question:

How would I scrape by searching? For example, If I wanted to scrape the product weight, is there a way to tell HTMLAgilityPack to search for an tag, and then go to the TD that follows it, and then scrape the inner text of that TD?

XPath is your friend. Learn it here. (In case of link rot, just Google an XPath 1.0 tutorial)

For that document:

   string weight= doc.DocumentNode.SelectSingleNode(@"//td[a[contains(text(),'Weight')]]/following-sibling::td").InnerText;

Will get you the weight.

Explanation for XPath: For all nodes (//) select "td" element which contains an "a" element that contains the text "Weight", and then select the following "td" node.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM