简体   繁体   English

我可以使用HtmlAgilityPack在特定标签上分割HTML文档吗?

[英]Can I use HtmlAgilityPack to split an HTML document on a certain tag?

For example, I have a bunch of <tr> tags I'd like to collect. 例如,我有一堆要收集的<tr>标记。 I need to split each of these tags into individual elements, for easier parsing on my part. 我需要将这些标签中的每一个拆分为单独的元素,以便更轻松地解析。

Is this possible? 这可能吗?

An example of the markup: 标记的示例:

<tr class="first-in-year">
  <td class="year">2011</td>

  <td class="img"><a href="/battlefield-3/61-27006/"><img src=
  "http://media.giantbomb.com/uploads/6/63038/1700748-bf3_thumb.jpg" alt=""></a></td>

  <td class="title">
    <a href="/battlefield-3/61-27006/">Battlefield 3</a>

    <p class="deck">Battlefield 3 is DICE's next installment in the franchise and
    will be on PC, PS3 and Xbox 360. The game will feature jets, prone, a
    single-player and co-op campaign, and 64-player multiplayer (on PC). It's due out
    in Fall of 2011.</p>
  </td>

  <td class="date">Expected: Q4 2011</td>

  <td><a href="/pc/60-94/" class="PC">PC</a>, <a href="/xbox-360/60-20/" class=
  "X360">X360</a>, <a href="/playstation-3/60-35/" class="PS3">PS3</a></td>
</tr>

<tr>
  <td class="year"></td>

  <td class="img"><a href="/forza-motorsport-4/61-33400/"><img src=
  "http://media.giantbomb.com/uploads/0/1992/1654849-forza4_thumb.jpg" alt=
  ""></a></td>

  <td class="title">
    <a href="/forza-motorsport-4/61-33400/">Forza Motorsport 4</a>

    <p class="deck">The next installment of Turn 10's racing franchise slated for
    release in Fall 2011. It is set to feature 16 player online races, dynamic race
    conditions, cars from over 80 manufacturers, and compatibility with Kinect, both
    on and off the racetrack.</p>
  </td>

  <td class="date">Expected: Oct 2011</td>

  <td><a href="/xbox-360/60-20/" class="X360">X360</a></td>
</tr>

<tr>
  <td class="year"></td>

  <td class="img"><a href="/max-payne-3/61-23398/"><img src=
  "http://media.giantbomb.com/uploads/0/1400/938434-custom_1237811317319_mp3_poster_thumb.jpg"
  alt=""></a></td>

  <td class="title">
    <a href="/max-payne-3/61-23398/">Max Payne 3</a>

    <p class="deck">The long awaited third instalment in Remedy's beloved series, in
    which an aging Max Payne faces one final chance to redeem himself.</p>
  </td>

  <td class="date">Expected: 2011</td>

  <td><a href="/pc/60-94/" class="PC">PC</a>, <a href="/playstation-3/60-35/" class=
  "PS3">PS3</a>, <a href="/xbox-360/60-20/" class="X360">X360</a></td>
</tr>

So I would have three elements here for this example. 因此,在此示例中,我将具有三个元素。 :) :)

You can't split it into multiple HTML documents on the tag if that's what you mean. 如果这是您的意思,则不能将其拆分为标记上的多个HTML文档。 You can select the individual TD elements and parse those separately. 您可以选择单个TD元素并分别解析它们。

The XPath selector //td will select all elements which you can pass into a parsing method. XPath选择器//td将选择所有可以传递到解析方法中的元素。

HtmlAgilityPack.HtmlDocument doc = LoadHtmlHowever();
doc.DocumentNode.SelectNodes("//td");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM