简体   繁体   English

stri_unescape_unicode()在某些字符上失败

[英]stri_unescape_unicode() fails on some characters

I have a problem with converting unicode characters in R. I am following this approach, but stri_unescape_unicode from library stringi fails to return correct value in some cases. 我有一个转换Unicode字符在R.我下面一个问题, 这种做法,但stri_unescape_unicode从库stringi未能在某些情况下返回正确的值。 Let me show an example where the correct value should be word Tomáš : 让我显示一个示例,其中正确的值应为单词Tomáš

library(stringi)
test <- "Tom<U+00E1><U+009A>"
test <- gsub("<U\\+(....)>", "\\\\u\\1", test)
stri_unescape_unicode(test)
[1] "Tomá\u009a"

However, if š is represented by U+0161 rather than U+009A , everything works as expected: 但是,如果šU + 0161而不是U + 009A表示 ,则所有操作均按预期进行:

test2 <- "Tom<U+00E1><U+0161>"
test2 <- gsub("<U\\+(....)>", "\\\\u\\1", test2)
stri_unescape_unicode(test2)
[1] "Tomáš"

Now, my problem is that I have large character vector with numerous elements like test and stri_unescape_unicode fails on some charactes like <U+009A> here. 现在,我的问题是我的character向量很大,包含许多元素,例如teststri_unescape_unicode在某些字符上失败,例如<U+009A> My question is: 我的问题是:

  • Is there a way to convert <U+009A> with stri_unescape_unicode or any other method? 有没有办法用stri_unescape_unicode或任何其他方法转换<U+009A>
  • Alternatively, is there a way to automatically replace unicodes in case stri_unescape_unicode fails? 或者,是否有一种方法可以在stri_unescape_unicode失败的情况下自动替换unicode? That is, in my example "Tom<U+00E1><U+009A>" should become "Tom<U+00E1><U+0161>" ? 也就是说,在我的示例中, "Tom<U+00E1><U+009A>" "Tom<U+00E1><U+0161>"

It appears that stri_unescape_unicode() has not failed. 看来stri_unescape_unicode()并未失败。 The character has been converted, but it is a control character ("single character introducer" U+009A) and is printed using its code. 该字符已转换,但是它是控制字符(“单个字符介绍器” U + 009A),并使用其代码进行打印。 Garbage in, garbage out. 垃圾进垃圾出。

How R prints Unicode strings depends on the type of the console and the locale used. R如何打印Unicode字符串取决于控制台的类型和使用的语言环境。 The following example has been run via the reprex package using code page 1252 in Windows. 在Windows中,使用代码页1252通过reprex包运行了以下示例。 Even though the unprintable character is printed using the <U+> or \\u\u003c/code> style, the actual Unicode character does exist in the corresponding R string. 即使使用<U+>\\u\u003c/code>样式打印了不可打印的字符,实际的Unicode字符确实存在于相应的R字符串中。

library(stringi)
test2 <- c("Tom<U+00E1><U+009A>", "Tom<U+00E1><U+0161>")
test2 <- gsub("<U\\+(....)>", "\\\\u\\1", test2)
unesc2 <- stri_unescape_unicode(test2)
unesc2
#> [1] "Tomá<U+009A>" "Tomáš"
nchar(unesc2)
#> [1] 5 5
cap2 <- capture.output(cat(unesc2, sep = "\n"))
cap2
#> [1] "Tomá<U+009A>" "Tomáš"
nchar(cap2)
#> [1] 12  5
which(nchar(cap2) > nchar(unesc2))
#> [1] 1
es2 <- encodeString(unesc2)
es2
#> [1] "Tomá\\u009a" "Tomáš"
nchar(es2)
#> [1] 10  5
which(nchar(es2) > nchar(unesc2))
#> [1] 1

I think capture.output() or encodeString() combined with nchar() can be used as above to detect strings with bad, ie, unprintable in current locale, characters. 我认为capture.output()encodeString()nchar() encodeString()结合可以像上面那样用于检测具有不良字符的字符串,即在当前语言环境中不可打印的字符。 Then, if it seems that all cases of U+009A should actually be U+0161, fixing those is a simple job for gsub() , eg, gsub("\š", "\š", unesc2) , and so on. 然后,如果看来U + 009A的所有情况实际上都应该是U + 0161,则解决这些问题对于gsub()来说很简单,例如gsub("\š", "\š", unesc2)等。 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM