简体   繁体   中英

Scraping With HtmlAgilityPack

I have a huge html page that i want to scrap values from it.

I tried to use Firebug to get the XPath of the element i want but it is not a static XPath as it is changes from time to time so how could i get the values i want.

In the following snippet i want to get the Production of Lumber per hour which is located in the 20

    <div class="boxes-contents cf"><table id="production" cellpadding="1" cellspacing="1">
    <thead>
        <tr>
            <th colspan="4">
                Production per hour:            </th>
        </tr>
    </thead>
    <tbody>
                <tr>
            <td class="ico">
                <img class="r1" src="img/x.gif" alt="Lumber" title="Lumber" />
            </td>
            <td class="res">
                Lumber:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r2" src="img/x.gif" alt="Clay" title="Clay" />
            </td>
            <td class="res">
                Clay:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r3" src="img/x.gif" alt="Iron" title="Iron" />
            </td>
            <td class="res">
                Iron:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r4" src="img/x.gif" alt="Crop" title="Crop" />
            </td>
            <td class="res">
                Crop:
            </td>
            <td class="num">
                59          </td>
        </tr>
            </tbody>
</table>
    </div>

Using Html agility pack you will want to do something like the following.

byte[] htmlBytes;
MemoryStream htmlMemStream;
StreamReader htmlStreamReader;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlBytes = webclient.DownloadData(url);
htmlMemStream = new MemoryStream(htmlBytes);
htmlStreamReader = new StreamReader(htmlMemStream);
htmlDoc.LoadHtml(htmlStreamReader.ReadToEnd());

var table = htmlDoc.DocumentNode.Descendants("table").FirstOrDefault();

var lumberTd = table.Descendants("td").Where(node => node.Attributes["class"] != null && node.Attributes["class"].Value == "num").FirstOrDefault();

string lumberValue = lumberTd.InnerText.Trim();

Warning, that 'FirstOrDefault()' can return null so you should probably put some checks in there.

Hope that helps.

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fileName);

var result = doc.DocumentNode.SelectNodes("//div[@class='boxes-contents cf']//tbody/tr")
                .First(tr => tr.Element("td").Element("img").Attributes["title"].Value == "Lumber")
                .Elements("td")
                .First(td=>td.Attributes["class"].Value=="num")
                .InnerText
                .Trim();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM