[英]Reading a UTF-8 text file (in Hebrew) shows gibrish in RStudio's console and fine in RGUI
I am trying to understand if this is a bug in RStudio or am I missing something. 我试图了解这是否是RStudio中的错误或我错过了什么。
I am reading a csv file into R. When printing it into the console in RStudio I get gibrish (unless I look at a specific vector). 我正在将一个csv文件读入R.当在RStudio中将它打印到控制台时,我得到了gibrish(除非我查看特定的向量)。 While in Rgui this is fine. 在Rgui,这很好。
The code I will run is this: 我将运行的代码是这样的:
Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
x # shows gibrish
x[,2]
colnames(x)
Here is the output from RStudio (gibrish) 这是RStudio(gibrish)的输出
> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
> x
âéì..áùðéí. îéâãø
1 23.0 æëø
2 24.0 ð÷áä
3 23.0 ð÷áä
4 24.0 ð÷áä
5 25.0 æëø
6 18.0 æëø
7 26.0 æëø
8 21.5 ð÷áä
9 24.0 æëø
10 26.0 æëø
11 24.0 æëø
12 19.0 ð÷áä
13 19.0 ð÷áä
14 24.5 æëø
15 21.0 ð÷áä
> x[,2]
[1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה
Levels: זכר נקבה
> colnames(x)
[1] "âéì..áùðéí." "îéâãø"
>
And here it is in Rgui (here it is fine): 这里是在Rgui(这里很好):
> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
> x # shows gibrish
גיל..בשנים. מיגדר
1 23.0 זכר
2 24.0 נקבה
3 23.0 נקבה
4 24.0 נקבה
5 25.0 זכר
6 18.0 זכר
7 26.0 זכר
8 21.5 נקבה
9 24.0 זכר
10 26.0 זכר
11 24.0 זכר
12 19.0 נקבה
13 19.0 נקבה
14 24.5 זכר
15 21.0 נקבה
> x[,2]
[1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה
Levels: זכר נקבה
> colnames(x)
[1] "גיל..בשנים." "מיגדר"
>
In both sessions, my sessionInfo() is: 在这两个会话中,我的sessionInfo()是:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255
[3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C
[5] LC_TIME=Hebrew_Israel.1255
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] installr_0.17.0
I'm using the latest RStudio version 0.99.892 我正在使用最新的RStudio版本0.99.892
Thanks. 谢谢。
This is a bug in R-studio and not the only one. 这是R-studio中的一个错误,而不是唯一的错误。 I've seen you have received a general answer about problems R-studio currently having with non-English locale support on windows. 我已经看到你已经收到一个关于R-studio目前在Windows上使用非英语语言环境支持的问题的一般答案。 As far as I know it is not the first time / version having similar problems. 据我所知,这不是第一次/版本有类似的问题。 You may also meet some new problems that I think related to win 10 . 您可能还会遇到一些我认为与赢得10相关的新问题 。 Note that since I'm having the second type of problems as well, I am using English locale to print Hebrew. 请注意,由于我也遇到了第二类问题,我使用英语语言环境来打印希伯来语。
So I have tried some debugging on your problem there and came with some work-around, and some new insights (I think..) on where is the problem. 所以我已经尝试了一些关于你的问题的调试,并附带了一些解决方法,以及一些新的见解(我认为......)问题在哪里。 I think it can be further debugged to write a complete function that will fix it, but due to time (and hour) restrictions I've decide to stop here. 我认为可以进一步调试来编写一个完整的函数来修复它,但是由于时间(和小时)限制,我决定停在这里。
I've created this data: 我已经创建了这些数据:
x <- data.frame("x"= c("דור","dor"))
As mentioned already, using Hebrew locale I as well get gibrish 如前所述,使用希伯来语语言环境我也很好
Sys.setlocale("LC_ALL", "Hebrew")
[1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
"דור"
[1] "ãåø"
x
x
1 ãåø
2 dor
Using English locale, I've get this output. 使用英语语言环境,我得到了这个输出。
Sys.setlocale("LC_ALL", "English")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
"דור"
[1] "דור"
x
x
1 <U+05D3><U+05D5><U+05E8>
2 dor
Note that non data.frame
output prints fine. 请注意,非data.frame
输出打印正常。 It also occurs with data.table
class, and prints fine with list
and matrix
. 它也出现在data.table
类中,并且使用list
和matrix
打印得很好。
Checking both print.data.frame
and print.table
methods reveals the main suspect: format
. 检查print.data.frame
和print.table
方法都会显示主要的print.table
: format
。
Further investigation confirm these suspicions: 进一步调查证实了这些怀疑:
as.matrix(x)
x
[1,] "דור"
[2,] "dor"
format(as.matrix(x))
x
[1,] "<U+05D3><U+05D5><U+05E8>"
[2,] "dor "
As such in your case I suggest following this workflow: 在你的情况下,我建议遵循这个工作流程:
Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
as.matrix(x)
âéì..áùðéí. îéâãø
[1,] "23.0" "זכר"
[2,] "24.0" "נקבה"
[3,] "23.0" "נקבה"
[4,] "24.0" "נקבה"
[5,] "25.0" "זכר"
[6,] "18.0" "זכר"
[7,] "26.0" "זכר"
[8,] "21.5" "נקבה"
[9,] "24.0" "זכר"
[10,] "26.0" "זכר"
[11,] "24.0" "זכר"
[12,] "19.0" "נקבה"
[13,] "19.0" "נקבה"
[14,] "24.5" "זכר"
[15,] "21.0" "נקבה"
Both locales: Hebrew and English worked on my machine, but col.names
didn't work for neither. 两种语言环境:希伯来语和英语都在我的机器上工作,但是col.names
也没有用。
To conclude, this is far from being a complete solution, but just a small and partial work-around the printing (or shall recall the formatting) problem. 总而言之,这远不是一个完整的解决方案,而只是一个小的部分解决方案 - 打印(或应该回忆格式化)问题。 It also shed some more light on this Hebrew / non-English issue in R-studio, on which some better solutions may be written. 它还为R-studio中的希伯来语/非英语问题提供了更多信息,可以编写一些更好的解决方案。 One example for a solution for a similar problem of writing Hebrew in windows can be seen on this SO thread . 在这个SO线程上可以看到在Windows中编写希伯来语类似问题的解决方案的一个示例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.