简体   繁体   English

Scala REPL中的Unicode正则表达式

[英]Unicode Regex in Scala REPL

I want to detect words of Unicode Letters ( \\p{L} ). 我想检测Unicode字母的单词( \\p{L} )。

Scala's REPL gives back false for the following statement, while in Java it's true (which is the right behaviour): Scala的REPL给出了以下语句的false ,而在Java中它是true (这是正确的行为):

java.util.regex.Pattern.compile("\\\\p{L}").matcher("ä").matches()

Both Java and Scala are running in JRE 1.7: Java和Scala都在JRE 1.7中运行:

System.getProperty("java.version") gives back "1.7.0_60-ea" System.getProperty("java.version")给出"1.7.0_60-ea"

What could be the reason for that? 可能是什么原因?

Probably a non-compatible character encoding used within the interpreter. 可能是解释器中使用的不兼容的字符编码。 For example, here's my output: 例如,这是我的输出:

scala> System.getProperty("file.encoding")
res0: String = UTF-8

scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true

So the solution is to run scala with -Dfile.encoding=UTF-8 . 所以解决方案是使用-Dfile.encoding=UTF-8运行scala Note, however, this blog post (which is a bit old) : 但请注意, 这篇博文 (有点旧):

The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application: 我们为Scala设置默认字符编码的唯一可靠方法是在运行应用程序之前设置$ JAVA_OPTS:

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 doesn't seem to do it. $ JAVA_OPTS="-Dfile.encoding=utf8" scala [...]只是尝试设置scala -Dfile.encoding=utf8似乎没有这样做。 [...] [...]


Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", eg: 不是这里的情况,但也可能发生:或者,你的“ä”可能是一个分音符(变音符号)后跟“a”,例如:

scala> println("a\u0308")                                                                                             
ä                                                                                                                                                                                                                    
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()                                         
res1: Boolean = false

This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). 在某些通过Unicode组合字符创建变音符号的系统上,这有时会出现问题(我认为OS X是一个,至少在某些版本中是这样)。 For more info, see Paul's question . 有关更多信息, 请参阅Paul的问题

You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS 您还可以“启用Unicode版本的预定义字符类和POSIX字符类”,如java.util.regex.PatternUNICODE_CHARACTER_CLASS中所述

This means you can use character classes such as '\\w' to match Unicode characters like this: 这意味着您可以使用'\\ w'等字符类来匹配Unicode字符,如下所示:

"(?U)\\w+".r.findFirstIn("pässi")

In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp. 在上面的正则表达式中,“(?U)”位是嵌入式标志表达式,它为正则表达式打开UNICODE_CHARACTER_CLASS标志。

This flag is supported starting from Java 7. 从Java 7开始支持此标志。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM