简体   繁体   English

Java正则表达式 - 匹配模式的第一次出现

[英]Java Regular Expressions - Matching the First Occurrence of a Pattern

I'm matching URLs against a regular expression, testing if they reflect a "shutdown" command. 我正在将URL与正则表达式进行匹配,测试它们是否反映了“shutdown”命令。

Here's a URL that performs a shutdown: 这是执行关闭的URL:

/exec?debug=true&command=shutdown&f=0

Here's another, legitimate but confusing URL that performs shutdown: 这是执行关闭的另一个合法但令人困惑的URL:

/exec?commando=yes&zcommand=34&command=shutdown&p

Now, I must ensure there's only one command=... parameter and it is command=shutdown . 现在,我必须确保只有一个command = ...参数,它是command = shutdown Alternatively, I can live with ensuring the first command=... parameter is command=shutdown . 或者,我可以确保第一个 命令= ...参数是command = shutdown

Here's my test for the requested regular expression: 这是我对所请求的正则表达式的测试:

/exec?version=0.4&command=shutdown&out=JSON&zcommand=1

Should match 应该匹配

/exec?version=0.4&command=startup&out=JSON&zcommand=1&commando=shutdown

Should fail to match 应该不匹配

/exec?command=shutdown&out=JSON

Should match 应该匹配

/exec?version=0.4&command=admin&out=JSON&zcommand=1&command=shutdown

Should fail to match 应该不匹配

Here's my baseline - a regular expression that passes the above tests - all but the last one: 这是我的基线 - 一个通过上述测试的正则表达式 - 除了最后一个:

^/exec?(.*\&)*command=shutdown(\&.*)*$

The problem is with the occurrence of more than one command=..., where the first one is not shutdown. 问题是出现多个command = ...,其中第一个不关闭。

I tried using lookbehind: 我尝试使用lookbehind:

^/exec?(.*\&)*(?<!(\&|\?)command=.*)command=shutdown(\&.*)*$

But I'm getting: 但是我得到了:

Look-behind group does not have an obvious maximum length near index 31

I even tried atomic grouping. 我甚至尝试过原子分组。 To no avail. 无济于事。 I can't make the following expression NOT match: 我不能使下面的表达式不匹配:

/exec?version=0.4&command=admin&out=JSON&zcommand=1&command=shutdown

Can anyone help with a regular expression that passes all the tests? 任何人都可以帮助通过所有测试的正则表达式吗?

Clarifications 澄清

I see I owe you some context. 我知道我欠你一些背景。

My task is to configure a Filter that guards the entrance of all our system's servlets, and verifies there's an open HTTP session (in other words: that a successful Login has occurred). 我的任务是配置一个过滤器来保护我们所有系统的servlet的入口,并验证是否有一个开放的HTTP会话(换句话说:已成功登录)。 The filter also allows configuring which URLs do not require login. 过滤器还允许配置哪些URL不需要登录。

Some exceptions are easy: /login does not need login. 一些例外很简单:/ login不需要登录。 Calls to localhost do not need login. 对localhost的调用不需要登录。

But sometimes it gets complicated. 但有时它会变得复杂。 Like the shutdown command that cannot require login while other commands can and should (the strange reason for that is out of the scope of my question). 就像shutdown命令一样,不能要求登录,而其他命令可以而且应该(这个奇怪的原因超出了我的问题范围)。

Since it's a security matter, I can't allow users to merely append &command=shutdown to a URL and bypass the filter. 由于这是一个安全问题,我不能允许用户只是将&command = shutdown附加到URL并绕过过滤器。

So I really need a regular expression, or otherwise I'll need to redefine the configuration specs. 所以我真的需要一个正则表达式,否则我需要重新定义配置规范。

You would need to do it in multiple steps: 您需要分步执行此操作:

(1) Find match of ^(?=\\/exec\\?).*?(?<=[?&])command=([^&]+) (1)找到^(?=\\/exec\\?).*?(?<=[?&])command=([^&]+)匹配^(?=\\/exec\\?).*?(?<=[?&])command=([^&]+)

(2) Check if match is shutdown (2)检查匹配是否shutdown

This tested (and fully commented) regex solution meets all your requirements: 这个经过测试(并且完全注释)的正则表达式解决方案满足您的所有要求:

import java.util.regex.*;
public class TEST {
    public static void main(String[] args) {
        Pattern re = Pattern.compile(
            "  # Match URI having command=shutdown query variable value. \n" +
            "  ^                          # Anchor to start of string.   \n" +
            "  (?:[^:/?\\#\\s]+:)?        # URI scheme (Optional).       \n" +
            "  (?://[^/?\\#\\s]*)?        # URI authority (Optional).    \n" +
            "  [^?\\#\\s]*                # URI path.                    \n" +
            "  \\?                        # Literal start of URI query.  \n" +
            "    # Match var=value pairs preceding 'command=xxx'.        \n" +
            "  (?:                        # Zero or more 'var=values'    \n" +
            "    (?!command=)             # only if not-'command=xxx'.   \n" +
            "    [^&\\#\\s]*              # Next var=value.              \n" +
            "    &                        # var=value separator.         \n" +
            "  )*                         # Zero or more 'var=values'    \n" +
            "  command=shutdown           # variable and value to match. \n" +
            "    # Match var=value pairs following 'command=shutdown'.   \n" +
            "  (?:                        # Zero or more 'var=values'    \n" +
            "    &                        # var=value separator.         \n" +
            "    (?!command=)             # only if not-'command=xxx'.   \n" +
            "    [^&\\#\\s]*              # Next var=value.              \n" +
            "  )*                         # Zero or more 'var=values'    \n" +
            "  (?:\\#\\S*)?               # URI fragment (Optional).     \n" +
            "  $                          # Anchor to end of string.", 
            Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
        String s = "/exec?version=0.4&command=shutdown&out=JSON&zcommand=1";
            // Should match
//      String s = "/exec?version=0.4&command=startup&out=JSON&zcommand=1&commando=shutdown";
            // Should fail to match 
//      String s = "/exec?command=shutdown&out=JSON";
            // Should match
//      String s = "/exec?version=0.4&command=admin&out=JSON&zcommand=1&command=shutdown";
        // Should fail to match";
        Matcher m = re.matcher(s);
        if (m.find()) {
            // Successful match
            System.out.print("Match found.\n");
        } else {
            // Match attempt failed
            System.out.print("No match found.\n");
        } 
    }
}

The above regex matches any RFC3986 valid URI having any scheme, authority, path, query or fragment components, but it must have one (and only one) query "command" variable whose value must be exactly, but case insensitively: "shutdown" . 上面的正则表达式匹配任何具有任何方案,权限,路径,查询或片段组件的RFC3986有效URI,但它必须有一个(且只有一个)查询"command"变量,其值必须完全,但不区分大小写: "shutdown"

A carefully crafted complex regex is perfectly fine (and maintainable) to use when written with proper indentation and commented steps (like shown above). 精心设计的复杂正则表达式在使用适当的缩进和注释步骤(如上所示)编写时可以完美地使用(并且可维护)。 (For more information on using regex to validate a URI, see my article: Regular Expression URI Validation ) (有关使用正则表达式验证URI的更多信息,请参阅我的文章: 正则表达式URI验证

Ok. 好。 I thank you all for your great answers! 我非常感谢你们的出色答案! I tried some of the suggestions, struggled with others, and all in all I have to agree that even if the right regex exists, it looks terrible, non maintainable, and can serve well as a nasty university exercise, but not in a real system configuration. 我尝试了一些建议,与其他人斗争,总而言之,我必须同意,即使正确的正则表达式存在,它看起来很糟糕,不可维护,并且可以很好地作为一个讨厌的大学练习,但不是在一个真实的系统中组态。

I also realize that since a Filter is involved here, and the Filter already parses its own URI, it is absolutely ridiculous to glue back all the URI parts into a string and match it against a regular expression. 我也意识到,由于此处涉及过滤器,并且过滤器已经解析了自己的URI,因此将所有URI部分粘合到字符串中并将其与正则表达式进行匹配绝对是荒谬的。 What was I thinking?? 我在想什么?

I'll therefore redesign the Filter and its configuration. 因此,我将重新设计Filter及其配置。

Thanks a lot, people! 非常感谢,人们! I appreciate the help :) 我很感激帮助:)

Noam Rotem. 诺姆罗特姆。

PS - why was I getting a userXXXX nick? PS - 我为什么得到一个userXXXX缺口? Very strange... 很奇怪...

If you can live with just accepting the first match, you could just use '\\\\Wcommand=([^&]+) and fetch the first group. 如果你只能接受第一场比赛,你可以使用'\\\\Wcommand=([^&]+)并获取第一组。

Otherwise, you could just call Matcher.find twice to test for subsequent matches, and eventually use the first match, why do you want to do this with a single complex regex? 否则,您可以只调用Matcher.find两次以测试后续匹配,并最终使用第一个匹配,为什么要使用单个复杂正则表达式执行此操作?

If this can be done with a single regular expression, and it may well could be; 如果这可以使用单个正则表达式完成,那么很可能就是这样; it will be so complex as to be un-readable, and thus un-maintainable as the intent of the logic will be lost. 它将是如此复杂,以至于不可读,因而无法维护,因为逻辑的意图将会丢失。 Even if it is "documented" it will still be much less obvious to someone who just knows Java. 即使它是“记录”的,对于刚认识Java的人来说,它仍然不那么明显。

Solving problems like this is an abuse of a regular expression as much as driving screws with a hammer is abusing the hammer and the screw both. 解决这样的问题是滥用正则表达,就像用锤子驱动螺钉一样滥用锤子和螺钉。

A much better approach would be to use the URI object parse the entire thing, domain and all and pull off the query parameters and then write a simple loop that walks through them and decides based on your business logic what is a shutdown and what isn't. 一个更好的方法是使用URI对象解析整个事物,域和所有并拉出查询参数,然后编写一个简单的循环,遍历它们并根据您的业务逻辑决定什么是关闭和什么是'吨。 Then it will be simple, self-documenting and probably more efficient ( not that that should be a concern ). 然后它将是简单的,自我记录的,可能更有效(不应该是一个问题)。

Some people, when confronted with a problem, think "I know, I'll use regular expressions." 有些人在面对问题时会想“我知道,我会使用正则表达式”。 Now they have two problems. 现在他们有两个问题。 -- Jamie Zawinski - 杰米·扎温斯基

Down vote all you want, but the best solution for this specific example is not a regular expression; 向下投票你想要的所有,但这个具体例子的最佳解决方案不是正则表达式; given the "clarification" even more so. 鉴于“澄清”更是如此。

Especially in a business environment where you have to share code with people, not only working with you now, but an unknown talent pool in the future. 特别是在您必须与人共享代码的商业环境中,不仅要与您合作,还要在未来与未知的人才库合作。 The "accepted" answer should never pass a corporate code review. “接受”的答案绝不应该通过公司代码审查。 Zawinski's quote applies to this situation exactly! Zawinski的报价恰恰适用于这种情况!

我不是Java编码器,但尝试这个(在Perl中工作)>>

^(?=\/exec\?)(?:[^&]+(?<![?&]command)=[^&]+&)*(?<=[?&])command=shutdown(?:&|$)

To match the first occurrence of command=shutdown use this: 要匹配第一次出现的command = shutdown,请使用以下命令:

Pattern.compile("^((?!command=).)+command=shutdown.*$");

The results will look like this: 结果将如下所示:

"/exec?version=0.4&command=shutdown&out=JSON&zcommand=1" => false
"/exec?command=shutdown&out=JSON" => true
"/exec?version=0.4&command=startup&out=JSON&zcommand=1&commando=shutdown" => false
"/exec?commando=yes&zcommand=34&command=shutdown&p" => false

If you want to match strings that ONLY contain one 'command=' use this: 如果你想匹配只包含一个'command ='的字符串,请使用:

Pattern.compile("^((?!command=).)+command=shutdown((?!command=).)+$");

Please note that using "not" qualifiers in regular expressions is not something they are intended for and performance might not be the best. 请注意,在正则表达式中使用“not”限定符不是它们的目的,性能可能不是最好的。

Try this: 尝试这个:

Pattern p = Pattern.compile(
    "^/exec\\?(?:(?:(?!\\1)command=shutdown()|(?!command=)\\w+(?:=[^&]+)?)(?:&|$))+$\\1");

Or a little more readably: 或者更可读:

^/exec\?
(?:
  (?:
    (?!\1)command=shutdown()
    |
    (?!command=)\w+(?:=[^&]+)?
  )
  (?:&|$)
)+$
\1

The main body of the regex is an alternation that matches either a shutdown command or a parameter whose name is not command . 正则表达式的主体是一个交替,它匹配关闭命令或名称不是command的参数。 If it does match a shutdown command, the empty group in that branch "captures" an empty string. 如果它与shutdown命令匹配,则该分支中的空组“捕获”空字符串。 It doesn't need to consume anything, because we're only using it as a checkbox, confirming en passant that one of the parameters was a shutdown command. 它不需要消耗任何东西,因为我们只是用它作为一个复选框,确认顺便的参数之一就是关机命令。

The negative lookahead - (?!\\1) - prevents it from matching two or more shutdown commands. 负向前瞻 - (?!\\1) - 阻止它匹配两个或多个关闭命令。 I don't know if that's really necessary, but it's a good opportunity to demonstrate (1) how to negate a "back-assertion", and (2) that a backreference can appear before the group it refers to in certain circumstances (what's known as a forward reference ). 我不知道这是否真的有必要,但这是一个很好的机会来证明(1)如何否定“反向断言”,以及(2)反向引用可以出现在它在某些情况下引用的组之前(什么是被称为前向参考 )。

When the whole URL has been consumed, the backreference ( \\1 ) acts like a zero-width assertion. 当消耗掉整个URL时,反向引用( \\1 )就像一个零宽度断言。 If one of the parameters was command=shutdown , the backreference will succeed. 如果其中一个参数是command=shutdown ,则反向引用将成功。 Otherwise it will fail even though it's only trying to match an empty string, because the group it refers to didn't participate in the match. 否则它将失败,即使它只是尝试匹配空字符串,因为它引用的组没有参与匹配。

But I have to concur with the other responders: when your regexes get this complicated, you should be thinking seriously about switching to a different approach. 但我必须同意其他响应者:当你的正则表达式变得复杂时,你应该认真考虑转向不同的方法。


EDIT: It works for me. 编辑:它适合我。 Here's the demo . 这是演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM