简体   繁体   English

索引nnn附近的未闭合字符类

[英]Unclosed character class near index nnn

I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central). 我正在从一些PHP Textile实现(开源,正确归因)中借用一个相当复杂的正则表达式,用于一个简单的,不完全特征完整的Java实现,textile4j,我正在移植到github并同步到Maven central(原始代码是编写为blojsom提供插件,这是一个Java博客平台;这是在Maven Central中提供blojsom依赖项的更大努力的一部分。

Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback in PHP) fail in Java with the following exception: 不幸的是,纺织品正则表达式(虽然它们在PHP中的preg_replace_callback上下文中工作)在Java中失败,但有以下异常:

java.util.regex.PatternSyntaxException: Unclosed character class near index 217

The statement is obvious, the solution is elusive. 声明很明显,解决方案难以捉摸。

Here's the raw, multiline regex from the PHP implementation: 这是来自PHP实现的原始多行正则表达式:

return preg_replace_callback('/
    (^|(?<=[\s>.\(])|[{[]) # $pre
    "                      # start
    (' . $this->c . ')     # $atts
    ([^"]+?)               # $text
    (?:\(([^)]+?)\)(?="))? # $title
    ":
    ('.$this->urlch.'+?)   # $url
    (\/)?                  # $slash
    ([^\w\/;]*?)           # $post
    ([\]}]|(?=\s|$|\)))
    /x',callback,input);

Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo that resulted in the following, rather long, regular expression: 巧妙地,我得到了纺织类来“显示我在这个正则表达式中使用的代码”,带有一个简单的echo ,导致以下相当长的正则表达式:

(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))

I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet . 我发现了一些可能导致解析错误的可能区域,使用gskinnerRegexPlanetRegExr等在线工具。 However, none of those particulars fix the error. 但是,这些细节都没有解决错误。

I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it. 我怀疑其中一个字符类中隐藏了一个范围问题,或隐藏在某个地方的Unicode命令,但我找不到它。

Any ideas? 有任何想法吗?

I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below. 我也很好奇为什么PHP不会抛出类似的错误,例如,我发现一个“被动子表达式”使用RegExr处理不当,但它没有修复Java异常并且没有改变PHP中的行为,显示下面。

In #title switch the escaped paren: #title切换转义的paren:

        (?:\(([^)]+?)\)(?="))? # $title
        ...^
        (?:(\([^)]+?)\)(?="))? # $title
        ....^

Thanks, Tim 蒂姆,谢谢

edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ... 编辑:添加Tex正则表达式的Java字符串解释(带转义),由RegexPlanet确定...

"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"

@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped. @CodeJockey是正确的:你的一个角色类中有一个方括号需要转义。 []] or [^]] are okay because the ] is the first character other than the negating ^ , but in Java an unescaped [ anywhere in a character class is a syntax error. []][^]]是可以的,因为]是除否定^之外的第一个字符,但在Java中,未转义[字符类中的任何位置都是语法错误。

Ironically, the original regex contains many backslashes that aren't needed even in PHP. 具有讽刺意味的是,原始的正则表达式包含许多反斜杠,即使在PHP中也是如此。 It also escapes / because that's what it uses as the regex delimiter. 它也逃脱/因为它用作正则表达式分隔符。 After weeding all those out I came up with this Java regex: 除草了所有这些后,我想出了这个Java正则表达式:

"(^|(?<=[\\s>.(])|[{\\[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:<(?!>)|(?<!<)>|<>|=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?)(/)?([^\\w/;]*?)([]}]|(?=\\s|$|\\)))"

Whether it's the best regex I have no idea, not knowing how it's being used. 它是否是最好的正则表达式我不知道,不知道它是如何被使用的。

I'm not sure exactly where your problem lies, but this might help: 我不确定你的问题究竟在哪里,但这可能会有所帮助:

In Java (and I believe this is unique to Java), the [ symbol (not just the ] symbol) is reserved inside character classes and needs to be escaped. 在Java中(我相信这是Java独有的), [符号(不仅仅是]符号)在字符类中保留,需要进行转义。

The revised expression should probably be similar to the following, in order to be Java-compatible: 修订后的表达式应该类似于以下内容,以便与Java兼容:

(^|(?<=[\s>.\(])|[{\[]) # $pre
"                       # start
(' . $this->c . ')      # $atts
([^"]+?)                # $text
(?:\(([^)]+?)\)(?="))?  # $title
":
('.$this->urlch.'+?)    # $url
(\/)?                   # $slash
([^\w\/;]*?)            # $post
([\]}]|(?=\s|$|\)))
/x

Basically, any place where most regex flavors will allow a character class like [az,;[\\]+-] - which would match "either a letter a - z or a comma, semicolon, open or close square bracket, plus or minus sign", needs to actually be [az,;\\[\\]+-] (escape the [ with a \\ character) 基本上,大多数正则表达式的任何地方都允许使用像[az,;[\\]+-]这样的字符类[az,;[\\]+-]它可以匹配“字母a - z或逗号,分号,开放或关闭方括号,加号或减号签名“,需要实际上是[az,;\\[\\]+-] (转义[带有\\字符]

This escaping requirement is due to the Java union, intersection and subtraction character-class constructs. 这种转义要求是由Java 联合,交集和减法字符类构造引起的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM