[英]stri_unescape_unicode() fails on some characters
I have a problem with converting unicode characters in R. I am following this approach, but stri_unescape_unicode
from library stringi
fails to return correct value in some cases. 我有一个转换Unicode字符在R.我下面一个问题, 这种做法,但
stri_unescape_unicode
从库stringi
未能在某些情况下返回正确的值。 Let me show an example where the correct value should be word Tomáš : 让我显示一个示例,其中正确的值应为单词Tomáš :
library(stringi)
test <- "Tom<U+00E1><U+009A>"
test <- gsub("<U\\+(....)>", "\\\\u\\1", test)
stri_unescape_unicode(test)
[1] "Tomá\u009a"
However, if š is represented by U+0161 rather than U+009A , everything works as expected: 但是,如果š由U + 0161而不是U + 009A表示 ,则所有操作均按预期进行:
test2 <- "Tom<U+00E1><U+0161>"
test2 <- gsub("<U\\+(....)>", "\\\\u\\1", test2)
stri_unescape_unicode(test2)
[1] "Tomáš"
Now, my problem is that I have large character
vector with numerous elements like test
and stri_unescape_unicode
fails on some charactes like <U+009A>
here. 现在,我的问题是我的
character
向量很大,包含许多元素,例如test
和stri_unescape_unicode
在某些字符上失败,例如<U+009A>
。 My question is: 我的问题是:
<U+009A>
with stri_unescape_unicode
or any other method? stri_unescape_unicode
或任何其他方法转换<U+009A>
? stri_unescape_unicode
fails? stri_unescape_unicode
失败的情况下自动替换unicode? That is, in my example "Tom<U+00E1><U+009A>"
should become "Tom<U+00E1><U+0161>"
? "Tom<U+00E1><U+009A>"
"Tom<U+00E1><U+0161>"
? It appears that stri_unescape_unicode()
has not failed. 看来
stri_unescape_unicode()
并未失败。 The character has been converted, but it is a control character ("single character introducer" U+009A) and is printed using its code. 该字符已转换,但是它是控制字符(“单个字符介绍器” U + 009A),并使用其代码进行打印。 Garbage in, garbage out.
垃圾进垃圾出。
How R prints Unicode strings depends on the type of the console and the locale used. R如何打印Unicode字符串取决于控制台的类型和使用的语言环境。 The following example has been run via the
reprex
package using code page 1252 in Windows. 在Windows中,使用代码页1252通过
reprex
包运行了以下示例。 Even though the unprintable character is printed using the <U+>
or \\u\u003c/code> style, the actual Unicode character does exist in the corresponding R string.
即使使用
<U+>
或\\u\u003c/code>样式打印了不可打印的字符,实际的Unicode字符确实存在于相应的R字符串中。
library(stringi)
test2 <- c("Tom<U+00E1><U+009A>", "Tom<U+00E1><U+0161>")
test2 <- gsub("<U\\+(....)>", "\\\\u\\1", test2)
unesc2 <- stri_unescape_unicode(test2)
unesc2
#> [1] "Tomá<U+009A>" "Tomáš"
nchar(unesc2)
#> [1] 5 5
cap2 <- capture.output(cat(unesc2, sep = "\n"))
cap2
#> [1] "Tomá<U+009A>" "Tomáš"
nchar(cap2)
#> [1] 12 5
which(nchar(cap2) > nchar(unesc2))
#> [1] 1
es2 <- encodeString(unesc2)
es2
#> [1] "Tomá\\u009a" "Tomáš"
nchar(es2)
#> [1] 10 5
which(nchar(es2) > nchar(unesc2))
#> [1] 1
I think
capture.output()
or encodeString()
combined with nchar()
can be used as above to detect strings with bad, ie, unprintable in current locale, characters. 我认为
capture.output()
或encodeString()
与nchar()
encodeString()
结合可以像上面那样用于检测具有不良字符的字符串,即在当前语言环境中不可打印的字符。 Then, if it seems that all cases of U+009A should actually be U+0161, fixing those is a simple job for gsub()
, eg, gsub("\", "\š", unesc2)
, and so on. 然后,如果看来U + 009A的所有情况实际上都应该是U + 0161,则解决这些问题对于
gsub()
来说很简单,例如gsub("\", "\š", unesc2)
等。 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.