Scala REPL中的Unicode正则表达式

Question

I want to detect words of Unicode Letters ( \\p{L} ). 我想检测Unicode字母的单词（ \\p{L} ）。

Scala's REPL gives back false for the following statement, while in Java it's true (which is the right behaviour): Scala的REPL给出了以下语句的false ，而在Java中它是true （这是正确的行为）：

java.util.regex.Pattern.compile("\\\\p{L}").matcher("ä").matches()

Both Java and Scala are running in JRE 1.7: Java和Scala都在JRE 1.7中运行：

System.getProperty("java.version") gives back "1.7.0_60-ea" System.getProperty("java.version")给出"1.7.0_60-ea"

What could be the reason for that? 可能是什么原因？

Answer 1

Probably a non-compatible character encoding used within the interpreter. 可能是解释器中使用的不兼容的字符编码。 For example, here's my output: 例如，这是我的输出：

scala> System.getProperty("file.encoding")
res0: String = UTF-8

scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true

So the solution is to run scala with -Dfile.encoding=UTF-8 . 所以解决方案是使用-Dfile.encoding=UTF-8运行scala 。 Note, however, this blog post (which is a bit old) : 但请注意，这篇博文（有点旧）：

The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application: 我们为Scala设置默认字符编码的唯一可靠方法是在运行应用程序之前设置$ JAVA_OPTS：

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 doesn't seem to do it. $ JAVA_OPTS="-Dfile.encoding=utf8" scala [...]只是尝试设置scala -Dfile.encoding=utf8似乎没有这样做。 [...] [...]

Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", eg: 不是这里的情况，但也可能发生：或者，你的“ä”可能是一个分音符（变音符号）后跟“a”，例如：

scala> println("a\u0308")                                                                                             
ä                                                                                                                                                                                                                    
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()                                         
res1: Boolean = false

This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). 在某些通过Unicode组合字符创建变音符号的系统上，这有时会出现问题（我认为OS X是一个，至少在某些版本中是这样）。 For more info, see Paul's question . 有关更多信息，请参阅Paul的问题。

Answer 2

You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS 您还可以“启用Unicode版本的预定义字符类和POSIX字符类”，如java.util.regex.Pattern和UNICODE_CHARACTER_CLASS中所述

This means you can use character classes such as '\\w' to match Unicode characters like this: 这意味着您可以使用'\\ w'等字符类来匹配Unicode字符，如下所示：

"(?U)\\w+".r.findFirstIn("pässi")

In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp. 在上面的正则表达式中，“（？U）”位是嵌入式标志表达式，它为正则表达式打开UNICODE_CHARACTER_CLASS标志。

This flag is supported starting from Java 7. 从Java 7开始支持此标志。

Scala REPL中的Unicode正则表达式

问题描述

2 个解决方案

解决方案1
6 已采纳 2014-02-17 20:07:19

解决方案2
2 2015-05-20 06:51:10

Scala REPL中的Unicode正则表达式

问题描述

2 个解决方案

解决方案1 6 已采纳 2014-02-17 20:07:19

解决方案2 2 2015-05-20 06:51:10

解决方案1
6 已采纳 2014-02-17 20:07:19

解决方案2
2 2015-05-20 06:51:10