R：即使在指定编码时也无法读取unicode文本文件

Question

I'm using R 3.1.1 on Windows 7 32bits. 我在Windows 7 32位上使用R 3.1.1。 I'm having a lot of problems reading some text files on which I want to perform textual analysis. 我在阅读一些我想要进行文本分析的文本文件时遇到了很多问题。 According to Notepad++, the files are encoded with "UCS-2 Little Endian" . 根据Notepad ++，文件使用“UCS-2 Little Endian”编码。 (grepWin, a tool whose name says it all, says the file is "Unicode".) （grepWin，一个名字说明这一切的工具，说该文件是“Unicode”。）

The problem is that I can't seem to read the file even specifying that encoding. 问题是我甚至无法读取文件甚至指定编码。 (The characters are of the standard spanish Latin set -ñáó- and should be handled easily with CP1252 or anything like that.) （这些字符属于标准的西班牙语拉丁语-ñáó-，应该可以使用CP1252或类似的东西轻松处理。）

> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
> readLines("filename.txt")
 [1] "ÿþE" ""    ""    ""    ""   ...
> readLines("filename.txt",encoding="UTF-8")
 [1] "\xff\xfeE" ""          ""          ""          ""    ...
> readLines("filename.txt",encoding="UCS2LE")
 [1] "ÿþE" ""    ""    ""    ""    ""    ""     ...
> readLines("filename.txt",encoding="UCS2")
 [1] "ÿþE" ""    ""    ""    ""    ...

Any ideas? 有任何想法吗？

Thanks!! 谢谢！！

edit: the "UTF-16", "UTF-16LE" and "UTF-16BE" encondings fails similarly 编辑：“UTF-16”，“UTF-16LE”和“UTF-16BE”encondings同样失败

Answer 1

After reading more closely to the documentation, I found the answer to my question. 在仔细阅读文档之后，我找到了问题的答案。

The encoding param of readLines only applies to the param input strings . readLines的encoding参数仅适用于param输入字符串 。 The documentation says: 文件说：

encoding to be assumed for input strings. 输入字符串的编码。 It is used to mark character strings as known to be in Latin-1 or UTF-8: it is not used to re-encode the input . 它用于标记已知为Latin-1或UTF-8的字符串： 它不用于重新编码输入 。 To do the latter, specify the encoding as part of the connection con or via options(encoding=): see the examples. 要执行后者，请将编码指定为连接con或via选项（encoding =）的一部分：请参阅示例。 See also 'Details'. 另请参阅“详细信息”。

The proper way of reading a file with an uncommon encoding is, then, 那么，使用不常见的编码读取文件的正确方法是：

filetext <- readLines(con <- file("UnicodeFile.txt", encoding = "UCS-2LE"))
close(con)

R：即使在指定编码时也无法读取unicode文本文件

问题描述

1 个解决方案

解决方案1
7 已采纳 2014-10-14 13:09:56

R：即使在指定编码时也无法读取unicode文本文件

问题描述

1 个解决方案

解决方案1 7 已采纳 2014-10-14 13:09:56

解决方案1
7 已采纳 2014-10-14 13:09:56