简体   繁体   English

无法将unicode .csv读入R

[英]Cannot read unicode .csv into R

I have a .csv file, which contains the following data: 我有一个.csv文件,其中包含以下数据:

"Ա","Բ"
1,10
2,20

I cannot read it into R so that the column names are displayed like they are in the file. 我无法将其读入R中,以便列名显示在文件中。

d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
head(d)

Produces the following: 产生以下内容:

> d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection './Data/1.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on './Data/1.csv'
> head(d)
[1] X.
<0 rows> (or 0-length row.names)

Meanwhile, doing the same without specifying the fileEncoding produces this: 同时,在不指定fileEncoding的情况下执行相同操作会产生以下结果:

> d <- read.csv("./Data/1.csv")
> head(d)
  Ô. Ô²
1  1 10
2  2 20

When I run the "file" utility to find out the encoding of the file, it says it is UTF-8: 当我运行“文件”实用程序来找出文件的编码时,它说它是UTF-8:

Data\1.csv: UTF-8 Unicode text, with CRLF line terminators

I am using RStudio, Windows 7, R version 2.15.2, 32-bit. 我使用的是RStudio,Windows 7,R版本2.15.2,32位。

Thanks in advance. 提前致谢。

I wrote a longer answer on the same issue here: R on Windows: character encoding hell . 我在这里写了一个更长的答案: Windows上的R:字符编码地狱

Quick answer, using the parameter encoding instead of fileEncoding should fix your first issue. 快速回答,使用参数编码而不是fileEncoding应该解决您的第一个问题。 You will not be able to read it possibly in either console or table view in RStudio, but you will be able to use it in formulaes. 您将无法在RStudio中的控制台或表视图中读取它,但您可以在公式中使用它。

d <- read.csv("./Data/1.csv", encoding="UTF-8")
head(d)

Having saved your table into a UTF-8 file: 将表保存为UTF-8文件:

> test2 <- read.csv("test2.csv", header = FALSE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", encoding = "UTF-8")
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'test2.csv'

This gives you how it looks like in the console and RStudio view 这将为您提供在控制台和RStudio视图中的外观

> test2
        V1       V2
1 <U+0531> <U+0532>
2        1       10
3        2       20

However importantly you are able to manipulate this within R. Thus in my case it is possible to see that the script window input Ա has UTF-8 encoding, and a grep correctly finds this encoding in your table. 但是重要的是你可以在R中操作它。因此在我的情况下,可以看到脚本窗口输入Ա具有UTF-8编码,并且grep在表中正确地找到了这种编码。

> Encoding("Ա")
[1] "UTF-8"
> grep("Ա", as.character(test2[1,1]))
[1] 1

You may need to find suitable encoding variants that work on your settings, or possibly change them. 您可能需要找到适合您的设置的合适编码变体,或者可能更改它们。 Unfortunately I am not sure where it is done. 不幸的是,我不确定它在哪里完成。

You might not be able to make it pretty in all stages, but it is definitely possible to get it to work also in Windows 7 environment. 您可能无法在所有阶段使其变得漂亮,但绝对有可能在Windows 7环境中使其工作。

I tried two ways to replicate your problem. 我尝试了两种方法来复制你的问题。

I copied the characters above into RStudio, saved it to a csv with this code: 我将上面的字符复制到RStudio中,用以下代码将其保存到csv:

write.csv(c("Ա","Բ",
             1,10,
             2,20), "test.csv")

df <- read.csv("test.csv")

This worked fine. 这很好。

Then I thought, well maybe R is cheating when I save it to CSV with R? 然后我想,当我用R保存到CSV时,R可能是作弊? So I just pasted the characters to a text file and save it as a CSV. 所以我只是将字符粘贴到文本文件中并将其另存为CSV。 This approach doesn't have problems either. 这种方法也没有问题。

Here's my session info: 这是我的会话信息:

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
[4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
[7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] party_1.0-9       modeltools_0.2-21 strucchange_1.4-7 sandwich_2.2-10   zoo_1.7-10       
[6] GGally_0.4.4      reshape_0.8.4     plyr_1.8          ggplot2_0.9.3.1  

loaded via a namespace (and not attached):
[1] coin_1.0-23        colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
[5] gtable_0.1.2       labeling_0.2       lattice_0.20-23    MASS_7.3-29       
[9] munsell_0.4.2      mvtnorm_0.9-9995   proto_0.3-10       RColorBrewer_1.0-5
[13] reshape2_1.2.2     scales_0.2.3       splines_3.0.1      stringr_0.6.2 

I had the same problem and found out that the file was corrupted. 我遇到了同样的问题,发现该文件已损坏。

I opened the file with OpenOffice and saved it back using "UTF8" character set (you need to click the edit filter settings box) and then imported it with the read.csv() (no encoding or filencoding option) and it worked fine. 我用OpenOffice打开文件并使用“UTF8”字符集保存回来(你需要点击编辑过滤器设置框),然后用read.csv() (没有编码或文件编码选项)导入它,它运行正常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM