简体   繁体   中英

Regex \n doesn't work

I'm trying to parse text out of two lines of HTML.

Dim PattStats As New Regex("class=""head"">(.+?)</td>"+ 
                           "\n<td>(.+?)</td>")
Dim makor As MatchCollection = PattStats.Matches(page)

For Each MatchMak As Match In makor
    ListView3.Items.Add(MatchMak.Groups(1).Value)
Next

I added the \\n to match the next line, but for some reason it won't work. Here's the source I'm running the regex against.

<table class="table table-striped table-bordered table-condensed">
  <tbody>
    <tr>
      <td class="head">Health Points:</td>
      <td>445 (+85 / per level)</td>
      <td class="head">Health Regen:</td>
      <td>7.25</td>
    </tr>
    <tr>
      <td class="head">Energy:</td>
      <td>200</td>
      <td class="head">Energy Regen:</td>
      <td>50</td>
    </tr>
    <tr>
      <td class="head">Damage:</td>
      <td>53 (+3.2 / per level)</td>
      <td class="head">Attack Speed:</td>
      <td>0.694 (+3.1 / per level)</td>
    </tr>           
    <tr>
      <td class="head">Attack Range:</td>
      <td>125</td>
      <td class="head">Movement Speed:</td>
      <td>325</td>
    </tr>
    <tr>
      <td class="head">Armor:</td>
      <td>16.5 (+3.5 / per level)</td>
      <td class="head">Magic Resistance:</td>
      <td>30 (+1.25 / per level)</td>
    </tr>       
    <tr>
      <td class="head">Influence Points (IP):</td>
      <td>3150</td>
      <td class="head">Riot Points (RP):</td>
      <td>975</td>
    </tr>
  </tbody>
</table>

I'd like to match the first <td class...> and the following line in one regex :/

Description

This regex will find td tags and return them in groups of two.

<td\\b[^>]*>([^<]*)<\\/td>[^<]*<td\\b[^>]*>([^<]*)<\\/td>

在此处输入图片说明

Summary

  • <td\\b[^>]*> find the first td tag and consume any attributes
  • ([^<]*) capture the first inner text, this can be greedy but we assume the cell has no nested tags
  • <\\/td> find the close tag
  • [^<]* move past all the rest of the text until you, this assumes there are no additional tags between the first and second td tag
  • <td\\b[^>]*> find the second td tage and consume any attributes
  • ([^<]*) capture the second inner text, this can be greedy but we assume the cell has no nested tags
  • <\\/td> find the close tag

Groups

Group 0 will get the entire string

  1. will have the first td group
  2. will have the second td group

VB.NET Code Example:

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim sourcestring as String = "replace with your source string"
    Dim re As Regex = New Regex("<td\b[^>]*>([^<]*)<\/td>[^<]*<td\b[^>]*>([^<]*)<\/td>",RegexOptions.IgnoreCase OR RegexOptions.Singleline)
    Dim mc as MatchCollection = re.Matches(sourcestring)
    Dim mIdx as Integer = 0
    For each m as Match in mc
      For groupIdx As Integer = 0 To m.Groups.Count - 1
        Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames(groupIdx), m.Groups(groupIdx).Value)
      Next
      mIdx=mIdx+1
    Next
  End Sub
End Module

$matches Array:
(
    [0] => Array
        (
            [0] => <td class="head">Health Points:</td>
          <td>445 (+85 / per level)</td>
            [1] => <td class="head">Health Regen:</td>
          <td>7.25</td>
            [2] => <td class="head">Energy:</td>
          <td>200</td>
            [3] => <td class="head">Energy Regen:</td>
          <td>50</td>
            [4] => <td class="head">Damage:</td>
          <td>53 (+3.2 / per level)</td>
            [5] => <td class="head">Attack Speed:</td>
          <td>0.694 (+3.1 / per level)</td>
            [6] => <td class="head">Attack Range:</td>
          <td>125</td>
            [7] => <td class="head">Movement Speed:</td>
          <td>325</td>
            [8] => <td class="head">Armor:</td>
          <td>16.5 (+3.5 / per level)</td>
            [9] => <td class="head">Magic Resistance:</td>
          <td>30 (+1.25 / per level)</td>
            [10] => <td class="head">Influence Points (IP):</td>
          <td>3150</td>
            [11] => <td class="head">Riot Points (RP):</td>
          <td>975</td>
        )

    [1] => Array
        (
            [0] => Health Points:
            [1] => Health Regen:
            [2] => Energy:
            [3] => Energy Regen:
            [4] => Damage:
            [5] => Attack Speed:
            [6] => Attack Range:
            [7] => Movement Speed:
            [8] => Armor:
            [9] => Magic Resistance:
            [10] => Influence Points (IP):
            [11] => Riot Points (RP):
        )

    [2] => Array
        (
            [0] => 445 (+85 / per level)
            [1] => 7.25
            [2] => 200
            [3] => 50
            [4] => 53 (+3.2 / per level)
            [5] => 0.694 (+3.1 / per level)
            [6] => 125
            [7] => 325
            [8] => 16.5 (+3.5 / per level)
            [9] => 30 (+1.25 / per level)
            [10] => 3150
            [11] => 975
        )

)

Disclaimer

Parsing html with a regex is really not the best solution as there a ton of edge cases what we can't predict. However in this case if input string is always this basic, and you're willing to accept the risk of the regex not working 100% of the time, then this solution would probably work for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM