简体   繁体   English

unicode正则表达式模式不起作用

[英]unicode regex pattern not working

I am trying to match some unicode charaters sequence: 我试图匹配一些unicode字符序列:

Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
    String text = "\\n     \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n    <\\/span>\\n<br style=\\";
    Matcher match = pattern.matcher(text);

but doing so gives this exception: 但这样做会产生以下异常:

Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
  \u05[dDeE][0-9a-fA-F]+
      ^

how can I use still use regex with some regex chars (like "[") to match unicode? 我如何使用仍然使用正则表达式与一些正则表达式字符(如“[”)匹配unicode?

EDIT: I'm trying to parse some text. 编辑:我正在尝试解析一些文本。 The text somewhere has a sequence of Unicode characters, which I know their code range. 某处的文本有一系列Unicode字符,我知道它们的代码范围。

Edit2: I am now using ranges instead : [\\\א-\\\ת]{2,} but still can't match the text above 编辑2:我现在正在使用范围: [\\\א-\\\ת]{2,}但仍然无法匹配上面的文字

Edit3: ok, now it's working, the problem was I used two backslashes instead of one , both in the regex and text. 编辑3:好的,现在它正在工作,问题是我在正则表达式和文本中使用了两个反斜杠而不是一个反斜杠。 The solution for this is, assuming I know there will be two chars or more: 对此的解决方案是,假设我知道将有两个或更多的字符:

[\u05d0-\u05ea]{2,}

Here is what causing the exception: 以下是导致异常的原因:

\\u05[dDeE][0-9a-fA-F]}{2,}
  ^^^^

The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \\uNNNN so it is giving an exception, because \\u\u003c/code> requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\\ if that is what you actually want. java正则表达式解析器认为你正在尝试使用转义序列\\uNNNN来匹配Unicode代码点,因此它给出了一个异常,因为\\u\u003c/code>需要四个十六进制数字,并且只有两个,即05所以你需要将其更改为\\\如果这是您真正想要的。

On the other hand, if you want to match \\\\u\u003c/code> in the target string, then you need to quad escape each backslash \\ like this \\\\\\\\ so to match \\\\u\u003c/code> you need \\\\\\\\\\\\\\\\u\u003c/code> . 另一方面,如果你想在目标字符串中匹配\\\\u\u003c/code> ,那么你需要四次转义每个反斜杠\\如此\\\\\\\\所以要匹配\\\\u\u003c/code>你需要\\\\\\\\\\\\\\\\u\u003c/code> 。

\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}

Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this: 最后,如果你想在目标字符串中逐字匹配那些Unicode代码点,那么你需要修改我们的最后一个表达式,如下所示:

(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}

Edit: Since there is only one backslash in your target string then your regular expression should be: 编辑:由于目标字符串中只有一个反斜杠,因此您的正则表达式应为:

(?:\\\\u05[dDeE][0-9a-fA-F]){2,}

This will match \כ\ד\ו\ר\ג\ל in your string 这将匹配\כ\ד\ו\ר\ג\ל中的\כ\ד\ו\ר\ג\ל

<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);

Edit 2: If you want to match literal \כ\ד\ו\ר\ג\ל then you can't use a range. 编辑2:如果要匹配文字\כ\ד\ו\ר\ג\ל则无法使用范围。

On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use: 另一方面,如果要在05d005df之间匹配Unicode代码点,则可以使用:

(?:[\\u05d0\\u05df]){2,}

It's not clear what you're trying to do. 目前尚不清楚你要做什么。 If your goal is to simplify matching a range of Unicode characters, then you need to realize that the hex digits are completely case insensitive, and so your a-fA-F is redundant, even if you could split character literals. 如果您的目标是简化匹配一系列Unicode字符,那么您需要意识到十六进制数字完全不区分大小写,因此即使您可以拆分字符文字,您的a-fA-F也是多余的。 Try this to match all Unicode characters in the range: 尝试此操作以匹配范围内的所有Unicode字符:

[\\u05d0-\\u0eff]

Looks like you have unnecessary \\\\ in your input string. 看起来你的输入字符串中有不必要的\\\\ Following works by replacing your specified unicode character range in regex: 以下工作通过替换正则表达式中指定的unicode字符范围:

String text = "\n  \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n    </span>\n<br style=\\";
System.out.println(text.replaceAll("[\u05d0-\u05ea]{2,}", "@@@"));

OUTPUT: OUTPUT:

  @@@
    </span>

Note that in our input text you had \\\\n and \\\כ etc that I have fixed. 请注意,在我们的输入文本中,您已经\\\כ\\\\n\\\כ等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM