简体   繁体   English

有没有办法正则表达式多行 html 块?

[英]is there a way to regex multiline html blocks?

It is a part of my html page.它是我的 html 页面的一部分。 I want to find all names between to tags : < a href ... < /a>< /td> Its multiline and 'new' keywrod has a different numbers each time.我想找到标记之间的所有名称: <a href ... </a></td> 它的多行和“新”关键字每次都有不同的数字。

        <tr class="hl">
        <td class="vil fc">
            <a href="mypage.php?new=4645">
                name                </a>
        </td>

The Regex class, by default, does search an entire multi-line string, and it will find matches that span multiple lines.默认情况下, Regex类会搜索整个多行字符串,并且会查找跨越多行的匹配项。 However, whether the matches can span multiple lines depends on your pattern.但是,匹配项是否可以跨越多行取决于您的模式。 If the pattern you give it says that the matches must all be on a single line, then it won't return any multi-line matches, obviously.如果你给它的模式说匹配必须全部在一行上,那么它显然不会返回任何多行匹配。 So, for instance:因此,例如:

Dim input As String = "Canine
Dog
K9
D
o
g
Puppy"
Dim count As Integer = Regex.Matches(input, "Dog").Count 
Dim countMulti As Integer = Regex.Matches(input, "D\s*o\s*g").Count 
Console.WriteLine(count)      ' Outputs "1"
Console.WriteLine(countMulti) ' Outputs "2"

Since \\s* means any amount of whitespace (including new-lines), the second pattern will match the second one, where each letter is on its own line.由于\\s*表示任意数量的空格(包括换行符),因此第二个模式将匹配第二个模式,其中每个字母都在自己的行上。

So, if it works by default, and you're asking about it, I assume the real problem is that you aren't allowing for new-lines in your pattern.因此,如果它默认有效,并且您正在询问它,我认为真正的问题是您不允许在模式中使用换行符。 So, for instance, this will work:因此,例如,这将起作用:

Dim input As String = "<tr class=""hl"">
<td class=""vil fc"">
<a href=""mypage.php?New=4645"">
        name                </a>
</td>"
Dim m As Match = Regex.Match(input, "<a[^>]*>((?:.|\s)*?)</a>")
If m.Success Then
    Dim g As String = m.Groups(1).Value
    Console.WriteLine(g)  ' Outputs vbCrLf & "                name                "
End If

A common assumption is that .一个常见的假设是. will match anything, including new-line characters, but that is not usually the case.将匹配任何内容,包括换行符,但通常情况并非如此。 By default, .默认情况下, . only matches anything but new-line characters.只有匹配任何不是新行字符。 If you want .如果你想要. to also include new-lines, you can do that by specifying the, perhaps confusingly named, RegexOptions.Singleline option.要还包括换行符,您可以通过指定可能会引起混淆的RegexOptions.Singleline选项来实现。 So for instance, this works too:例如,这也有效:

Dim input As String = "<tr class=""hl"">
<td class=""vil fc"">
<a href=""mypage.php?New=4645"">
        name                </a>
</td>"
Dim m As Match = Regex.Match(input, "<a[^>]*>(.*?)</a>", RegexOptions.Singleline)
If m.Success Then
    Dim g As String = m.Groups(1).Value
    Console.WriteLine(g)  ' Outputs vbCrLf & "                name                "
End If

Alternatively, you can specify the single-line option, right in the regex pattern, itself, by putting (?s) at the beginning:或者,您可以在正则表达式模式本身中指定单行选项,方法是将(?s)放在开头:

Dim m As Match = Regex.Match(input, "(?s)<a[^>]*>(.*?)</a>")

To address your additional concern mentioned in the comments, if you want to match only links containing a newdid parameter in them, you could do something like this:为了解决您在评论中提到的其他问题,如果您只想匹配包含newdid参数的链接,您可以执行以下操作:

<a\s+[^>]*href\s*=[^>]*newdid\s*=[^>]*>(.*?)</a>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM