简体   繁体   English

Scala - 没有Apache的unescape Unicode字符串

[英]Scala - unescape Unicode String without Apache

I have a String "b\ôlovar" and i was wondering if it's possible to unescape without use Commons-lang. 我有一个字符串“b \\ u00f4lovar”,我想知道是否可以在不使用Commons-lang的情况下进行unescape。 It works but i'm facing a problem on some enviroments and i would like to minimize it (ie: it works on my machine but not works on production). 它的工作原理,但我在一些环境中遇到问题,我想最小化它(即:它适用于我的机器,但不适用于生产)。

StringEscapeUtils.unescapeJava(variables.getOrElse("name", ""))

How can i unescape it without apache lib? 如何在没有apache lib的情况下解决它?

Thank in advance. 预先感谢。

Only Unicode escapes 只有Unicode转义

If you want to unescape only sequences in the format \ than it is simple to do it with a single regex replace: 如果你想使用格式\格式化序列,那么使用单个正则表达式替换它很简单:

def unescapeUnicode(str: String): String =
  """\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar match {
      case '\\' => """\\"""
      case '$' => """\$"""
      case c => c.toString
    })

And the result is 结果是

scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bôlovar シ

We have to process characters $ and \\ separately, because they are treated as special by the java.util.regex.Matcher.appendReplacement method: 我们必须分别处理字符$\\ ,因为它们被java.util.regex.Matcher.appendReplacement方法视为特殊字符:

def wrongUnescape(str: String): String =
  """\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar.toString)

scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
  at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
  ... 46 elided

scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
   at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
   ... 46 elided

All escape characters 所有转义字符

Unicode character escapes are a bit special: they are not a part of string literals, but a part of the program code. Unicode字符转义有点特殊:它们不是字符串文字的一部分,而是程序代码的一部分。 There is a separate phase to replace unicode escapes with characters: 有一个单独的阶段用字符替换unicode转义:

scala> Integer.toString('a', 16)
res2: String = 61

scala> val \u0061 = "foo"
a: String = foo

scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = "    " 

There is a function StringContext.treatEscapes in Scala library, that supports all normal escapes from the language specification. Scala库中有一个函数StringContext.treatEscapes ,它支持语言规范中的所有正常转义

So if you want to support unicode escapes and all normal Scala escapes, you can unescape both sequentially: 因此,如果您想支持unicode转义和所有正常的Scala转义,您可以按顺序浏览:

def unescape(str: String): String =
  StringContext.treatEscapes(unescapeUnicode(str))

scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b

scala> unescape("\\u005ct")
res5: String = "    "

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM