简体   繁体   中英

Reading a UTF-8 text file (in Hebrew) shows gibrish in RStudio's console and fine in RGUI

I am trying to understand if this is a bug in RStudio or am I missing something.

I am reading a csv file into R. When printing it into the console in RStudio I get gibrish (unless I look at a specific vector). While in Rgui this is fine.

The code I will run is this:

Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")  
x # shows gibrish
x[,2]
colnames(x)

Here is the output from RStudio (gibrish)

> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
> x
   âéì..áùðéí. îéâãø
1         23.0   æëø
2         24.0  ð÷áä
3         23.0  ð÷áä
4         24.0  ð÷áä
5         25.0   æëø
6         18.0   æëø
7         26.0   æëø
8         21.5  ð÷áä
9         24.0   æëø
10        26.0   æëø
11        24.0   æëø
12        19.0  ð÷áä
13        19.0  ð÷áä
14        24.5   æëø
15        21.0  ð÷áä
> x[,2]
 [1] זכר  נקבה נקבה נקבה זכר  זכר  זכר  נקבה זכר  זכר  זכר  נקבה נקבה זכר  נקבה
Levels: זכר נקבה
> colnames(x)
[1] "âéì..áùðéí." "îéâãø"      
> 

And here it is in Rgui (here it is fine):

>     x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")  
>     x # shows gibrish
   גיל..בשנים. מיגדר
1         23.0   זכר
2         24.0  נקבה
3         23.0  נקבה
4         24.0  נקבה
5         25.0   זכר
6         18.0   זכר
7         26.0   זכר
8         21.5  נקבה
9         24.0   זכר
10        26.0   זכר
11        24.0   זכר
12        19.0  נקבה
13        19.0  נקבה
14        24.5   זכר
15        21.0  נקבה
>     x[,2]
 [1] זכר  נקבה נקבה נקבה זכר  זכר  זכר  נקבה זכר  זכר  זכר  נקבה נקבה זכר  נקבה
Levels: זכר נקבה
>     colnames(x)
[1] "גיל..בשנים." "מיגדר"      
> 

In both sessions, my sessionInfo() is:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255   
[3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C                  
[5] LC_TIME=Hebrew_Israel.1255    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] installr_0.17.0

I'm using the latest RStudio version 0.99.892

Thanks.

This is a bug in R-studio and not the only one. I've seen you have received a general answer about problems R-studio currently having with non-English locale support on windows. As far as I know it is not the first time / version having similar problems. You may also meet some new problems that I think related to win 10 . Note that since I'm having the second type of problems as well, I am using English locale to print Hebrew.

So I have tried some debugging on your problem there and came with some work-around, and some new insights (I think..) on where is the problem. I think it can be further debugged to write a complete function that will fix it, but due to time (and hour) restrictions I've decide to stop here.

I've created this data:

x <- data.frame("x"= c("דור","dor"))

As mentioned already, using Hebrew locale I as well get gibrish

Sys.setlocale("LC_ALL", "Hebrew")
[1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"

"דור"
[1] "ãåø"

x
   x
1 ãåø
2 dor

Using English locale, I've get this output.

Sys.setlocale("LC_ALL", "English")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

 "דור"
[1] "דור"

x
                         x
1 <U+05D3><U+05D5><U+05E8>
2                      dor

Note that non data.frame output prints fine. It also occurs with data.table class, and prints fine with list and matrix .

Checking both print.data.frame and print.table methods reveals the main suspect: format .

Further investigation confirm these suspicions:

as.matrix(x)
     x    
[1,] "דור"
[2,] "dor"

format(as.matrix(x))
     x                         
[1,] "<U+05D3><U+05D5><U+05E8>"
[2,] "dor                     "

As such in your case I suggest following this workflow:

Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")  
as.matrix(x) 
      âéì..áùðéí. îéâãø 
 [1,] "23.0"      "זכר" 
 [2,] "24.0"      "נקבה"
 [3,] "23.0"      "נקבה"
 [4,] "24.0"      "נקבה"
 [5,] "25.0"      "זכר" 
 [6,] "18.0"      "זכר" 
 [7,] "26.0"      "זכר" 
 [8,] "21.5"      "נקבה"
 [9,] "24.0"      "זכר" 
[10,] "26.0"      "זכר" 
[11,] "24.0"      "זכר" 
[12,] "19.0"      "נקבה"
[13,] "19.0"      "נקבה"
[14,] "24.5"      "זכר" 
[15,] "21.0"      "נקבה"

Both locales: Hebrew and English worked on my machine, but col.names didn't work for neither.

To conclude, this is far from being a complete solution, but just a small and partial work-around the printing (or shall recall the formatting) problem. It also shed some more light on this Hebrew / non-English issue in R-studio, on which some better solutions may be written. One example for a solution for a similar problem of writing Hebrew in windows can be seen on this SO thread .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM