简体   繁体   English

R无法使用UTF-8正确编码romainan字符

[英]R fails to encode romainan characters with UTF-8 correctly

I'm working on a dataset of locations where some location names use local characters. 我正在处理一些位置名称使用本地字符的位置数据集。 Most characters are viewed correctly, but I'm having issues with some Romanian characters, like for example "ș". 大多数字符都可以正确查看,但是我遇到了一些罗马尼亚字符的问题,例如“ ș”。

I have tried changing my Windows 10 64 bit system locale to use UTF-8 encoding, but that did not solve the issue. 我曾尝试将Windows 10 64位系统区域设置更改为使用UTF-8编码,但这并不能解决问题。

A sample file can be found here for testing: https://drive.google.com/file/d/1T7QQQ7G_dA_rXD9Ewf51uuQ6CUkscjP_/view?usp=sharing 可以在此处找到示例文件进行测试: https : //drive.google.com/file/d/1T7QQQ7G_dA_rXD9Ewf51uuQ6CUkscjP_/view?usp=sharing

This line imports the data: 此行导入数据:

df <- read.delim("R_Encode_Issue.csv", header=TRUE, sep=",", encoding = "UTF-8", colClasses=c("character","character","character"))

> df
  region country         chapter
1 Europe Moldova Chi<U+0219>inau

This displays the location chapter as "Chiinau" (Stackoverflow can't displays this even :D) both in the console and in the viewer. 这在控制台和查看器中均将位置章节显示为“ Chiinau”(Stackoverflow甚至无法显示该内容:D)。

If I convert the data_table to a tibble: 如果我将data_table转换为小标题:

df2 <- as_tibble(df)

> df2
# A tibble: 1 x 3
  region country chapter 
  <chr>  <chr>   <chr>   
1 Europe Moldova Chișinău

The console displays the location chapter as "Chișinău" but the viewer as "Chiinau". 控制台将位置章节显示为“ Chișinău”,但将查看器显示为“ Chiinau”。

I write the data to a .csv file: 我将数据写入.csv文件:

write.csv(df2, file = "R_Encode_Out.csv",row.names=FALSE, na="", fileEncoding = "UTF-8") 

And the location chapter is written as "Chiinau" in the written file. 位置章节在书面文件中写为“ Chiinau”。

R version: R版本:

platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.3                         
year           2019                        
month          03                          
day            11                          
svn rev        76217                       
language       R                           
version.string R version 3.5.3 (2019-03-11)
nickname       Great Truth     

RStudio version: RStudio版本:

$mode
[1] "desktop"

$version
[1] ‘1.1.463’

I expected the viewer, or at least the written file to display the characters correctly, when I use UTF-8 as the encoding on import and export. 当我使用UTF-8作为导入和导出的编码时,我希望查看器或至少书面文件能够正确显示字符。 But the case is that the characters are exported incorrectly. 但是,情况是字符导出错误。

Any insight on what I can do to correct this? 我可以采取什么纠正措施的任何见解?

Try using a different import and export functions than base R. I got this to work using readr in the exported file (it seems that viewer does display it as Chi<U+0219>inau . The exported file opens correctly in notepad, and in Excel if I specify that it has UTF-8 encoding. 尝试使用与base R不同的导入和导出功能。我在导出的文件中使用readr可以正常工作(似乎查看器确实将其显示为Chi<U+0219>inau 。导出的文件在记事本中以及在如果我指定它具有UTF-8编码,则为Excel。

library(readr)
df <- read_csv("C:/Users/Andrew/Downloads/R_Encode_Issue.csv", locale = locale(encoding = "UTF-8"))

df
# A tibble: 1 x 3
  region country chapter 
  <chr>  <chr>   <chr>   
1 Europe Moldova Chișinău

write_csv(df, "C:/Users/Andrew/Desktop/R_Encode_Issue.csv")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM