简体   繁体   中英

is there a way to regex multiline html blocks?

It is a part of my html page. I want to find all names between to tags : < a href ... < /a>< /td> Its multiline and 'new' keywrod has a different numbers each time.

        <tr class="hl">
        <td class="vil fc">
            <a href="mypage.php?new=4645">
                name                </a>
        </td>

The Regex class, by default, does search an entire multi-line string, and it will find matches that span multiple lines. However, whether the matches can span multiple lines depends on your pattern. If the pattern you give it says that the matches must all be on a single line, then it won't return any multi-line matches, obviously. So, for instance:

Dim input As String = "Canine
Dog
K9
D
o
g
Puppy"
Dim count As Integer = Regex.Matches(input, "Dog").Count 
Dim countMulti As Integer = Regex.Matches(input, "D\s*o\s*g").Count 
Console.WriteLine(count)      ' Outputs "1"
Console.WriteLine(countMulti) ' Outputs "2"

Since \\s* means any amount of whitespace (including new-lines), the second pattern will match the second one, where each letter is on its own line.

So, if it works by default, and you're asking about it, I assume the real problem is that you aren't allowing for new-lines in your pattern. So, for instance, this will work:

Dim input As String = "<tr class=""hl"">
<td class=""vil fc"">
<a href=""mypage.php?New=4645"">
        name                </a>
</td>"
Dim m As Match = Regex.Match(input, "<a[^>]*>((?:.|\s)*?)</a>")
If m.Success Then
    Dim g As String = m.Groups(1).Value
    Console.WriteLine(g)  ' Outputs vbCrLf & "                name                "
End If

A common assumption is that . will match anything, including new-line characters, but that is not usually the case. By default, . only matches anything but new-line characters. If you want . to also include new-lines, you can do that by specifying the, perhaps confusingly named, RegexOptions.Singleline option. So for instance, this works too:

Dim input As String = "<tr class=""hl"">
<td class=""vil fc"">
<a href=""mypage.php?New=4645"">
        name                </a>
</td>"
Dim m As Match = Regex.Match(input, "<a[^>]*>(.*?)</a>", RegexOptions.Singleline)
If m.Success Then
    Dim g As String = m.Groups(1).Value
    Console.WriteLine(g)  ' Outputs vbCrLf & "                name                "
End If

Alternatively, you can specify the single-line option, right in the regex pattern, itself, by putting (?s) at the beginning:

Dim m As Match = Regex.Match(input, "(?s)<a[^>]*>(.*?)</a>")

To address your additional concern mentioned in the comments, if you want to match only links containing a newdid parameter in them, you could do something like this:

<a\s+[^>]*href\s*=[^>]*newdid\s*=[^>]*>(.*?)</a>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM