简体   繁体   English

字符串中的Unescape unicode

[英]Unescape unicode in character string

There is a long standing bug in RJSONIO for parsing json strings containing unicode escape sequences. RJSONIO存在一个长期存在的错误 ,用于解析包含unicode转义序列的json字符串。 It seems like the bug needs to be fixed in libjson which might not happen any time soon, so I am looking in creating a workaround in R which unescapes \\uxxxx sequences before feeding them to the json parser. 看起来这个bug需要在libjson中修复,这可能不会很快发生,所以我正在寻找在R中创建一个解决方法,在将它们提供给json解析器之前进行unescapes \\uxxxx序列。

Some context: json data is always unicode, using utf-8 by default, so there is generally no need for escaping. 一些上下文:json数据总是unicode,默认使用utf-8 ,因此通常不需要转义。 But for historical reasons, json does support escaped unicode. 但由于历史原因,json确实支持转义的unicode。 Hence the json data 因此json数据

{"x" : "Zürich"}

and

{"x" : "Z\u00FCrich"}

are equivalent and should result in exactly the same output when parsed. 是等价的,并且在解析时应该产生完全相同的输出。 But for whatever reason, the latter doesn't work in RJSONIO . 但无论出于何种原因,后者在RJSONIORJSONIO Additional confusion is caused by the fact that R itself supports escaped unicode as well. 另外一个混乱是由于R本身也支持转义的unicode。 So when we type "Z\ürich" in an R console, it is automatically correctly converted to "Zürich" . 因此,当我们在R控制台中键入"Z\ürich"时,它会自动正确转换为"Zürich" To get the actual json string at hand, we need to escape the backslash itself that is the first character of the unicode escape sequence in json: 为了获得实际的json字符串,我们需要转义反斜杠本身,它是json中unicode转义序列的第一个字符:

test <- '{"x" : "Z\\u00FCrich"}'
cat(test)

So my question is: given a large json string in R, how can I unescape all escaped unicode sequences? 所以我的问题是:在R中给出一个大的json字符串,我怎样才能解除所有转义的unicode序列? Ie how do I replace all occurrences of \\uxxxx by the corresponding unicode character? 即如何用相应的unicode字符替换所有出现的\\uxxxx Again, the \\uxxxx here represents an actual string of 6 characters, starting with a backslash. 同样, \\uxxxx这里表示一个6个字符的实际字符串,以反斜杠开头。 So an unescape function should satisfy: 所以一个unescape函数应该满足:

#Escaped string
escaped <- "Z\\u00FCrich"

#Unescape unicode
unescape(escaped) == "Zürich"

#This is the same thing
unescape(escaped) == "Z\u00FCrich"

One thing that might complicate things is that if the backslash itself is escaped in json with another backslash, it is not part of the unicode escape sequence. 可能使事情复杂化的一件事是,如果反斜杠本身在json中使用另一个反斜杠进行转义,则它不是 unicode转义序列的一部分。 Eg unescape should also satisfy: 例如, unescape也应该满足:

#Watch out for escaped backslashes
unescape("Z\\\\u00FCrich") == "Z\\\\u00FCrich"
unescape("Z\\\\\\u00FCrich") == "Z\\\\ürich"

After playing with this some more I think the best I can do is searching for \\uxxxx patterns using a regular expression, and then parse those using the R parser: 在玩了这个之后,我认为我能做的最好的事情就是使用正则表达式搜索\\uxxxx模式,然后使用R解析器解析它们:

unescape_unicode <- function(x){
  #single string only
  stopifnot(is.character(x) && length(x) == 1)

  #find matches
  m <- gregexpr("(\\\\)+u[0-9a-z]{4}", x, ignore.case = TRUE)

  if(m[[1]][1] > -1){
    #parse matches
    p <- vapply(regmatches(x, m)[[1]], function(txt){
      gsub("\\", "\\\\", parse(text=paste0('"', txt, '"'))[[1]], fixed = TRUE, useBytes = TRUE)
    }, character(1), USE.NAMES = FALSE)

    #substitute parsed into original
    regmatches(x, m) <- list(p)
  }

  x
}

This seems to work for all cases and I haven't found any odd side effects yet 这似乎适用于所有情况,我还没有发现任何奇怪的副作用

There is a function for this in stringi package :) stringi包中有一个函数:)

require(stringi)    
escaped <- "Z\\u00FCrich"
escaped
## [1] "Z\\u00FCrich"
stri_unescape_unicode(escaped)
## [1] "Zürich"

Maybe like this? 也许是这样的?

\"x\"\s:\s\"([^"]*?)\"

This is not looking letters. 这不是看信。 Just waiting for a quote 等待报价

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM