简体   繁体   English

Java正则表达式为什么这两个正则表达式不同

[英]java regex why these two regular expressions are different

I have a java string demonstrating a div element: 我有一个Java字符串演示div元素:

String source = "<div class = \"ads\">\n" +
                "\t<dl style = \"font-size:14px; color:blue;\">\n" +
                "\t\t<li>\n" +
                "\t\t\t<a href = \"http://ggicci.blog.163.com\" target = \"_blank\">Ggicci's Blog</a>\n" +
                "\t\t</li>\n" +
                "\t</dl>\n" +
                "</div>\n";

which in html form is: html形式的是:

<div class = "ads">
    <dl style = "font-size:14px; color:blue;">
        <li>
            <a href = "http://ggicci.blog.163.com" target = "_blank">Ggicci's Blog</a>
        </li>
    </dl>
</div>

And I write such a regex to extract dl element: 我编写了这样一个正则表达式来提取dl元素:

<dl[.\\s]*?>[.\\s]*?</div>

But it finds nothing and I modified it to be: 但它什么也没找到,我将其修改为:

<dl(.|\\s)*?>(.|\\s)*?</div>

then it works. 然后就可以了。 So I tested like this: 所以我像这样测试:

System.out.println(Pattern.matches("[.\\s]", "a")); --> false
System.out.println(Pattern.matches("[abc\\s]", "a")); --> true

so why the '.' 那为什么是“。” cant match 'a' ? 无法匹配“ a”?

Inside the square brackets, the characters are treated literaly. 在方括号内,字符按字面意义对待。 [.\\\\s] means "Match a dot, or a backslash or as". [.\\\\s]意思是“匹配点,反斜杠或as”。


(.|\\\\s) is equivalent to . (.|\\\\s)等同于. .


I think you really want the following regex: 我认为您确实需要以下正则表达式:

<dl[^>]*>.*?</div>

+1 for above. +1以上。

I would do: 我会做:

<dl[^>]*>(.*?)</dl>

To match the content of dl 匹配dl的内容

the syntax [.\\\\s] makes no sense, because, and Daniel said, the . 语法[.\\\\s]没有任何意义,因为,但丹尼尔说, . just means "a dot" in this context. 在此上下文中仅表示“点”。

Why can't you replace your [.\\\\s] with a much simpler . 为什么不能用更简单的替换[.\\\\s] . ?

When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals. 当您在帖子中包含正则表达式时,最好在实际使用它们时发布它们-在这种情况下,应作为Java字符串文字。

"[.\\\\s]" is a Java string literal representing the regex [.\\s] ; "[.\\\\s]"是表示正则表达式[.\\s]的Java字符串文字; it matches a literal dot or a whitespace character. 它与文字点或空格字符匹配。 Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that . 正如其他人所说,您的正则表达式并不试图匹配反斜杠或's',但关键因素是. loses its special meaning inside a character class. 在角色类中失去其特殊含义。

"(.|\\\\s)" is a Java string literal representing the regex (.|\\s) ; "(.|\\\\s)"是表示正则表达式(.|\\s)的Java字符串文字; it matches ( anything but a line separator character OR any whitespace character ). 匹配( 除行分隔符任何空白字符外 )。 It works as you intended, but don't use it! 它可以按您的预期工作,但是请不要使用它! It leaves you extremely vulnerable to catastrophic backtracking , as explained in this answer . 答案中所述 ,它使您极易遭受灾难性的回溯

But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables . 但不用担心,您真正需要做的就是使用DOTALL模式(也称为单行模式),该模式启用. to match anything including line separator characters. 匹配任何内容, 包括行分隔符。

(?s)<dl\b[^>]*>.*?</dl>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM