简体   繁体   English

使用 BOM 读取 UTF-8 文本文件

[英]Read a UTF-8 text file with BOM

I have a text file with Byte order mark (U+FEFF) at the beginning.我有一个开头带有字节顺序标记(U+FEFF)的文本文件。 I am trying to read the file in R. Is it possible to avoid the Byte order mark?我正在尝试在 R 中读取文件。是否可以避免字节顺序标记?

The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:函数fread (来自data.table包)读取文件,但在第一个变量名的开头添加ļ»æ

> names(frame_pers)[1]
[1] "ļ»æreg_date"

The same is with read.csv function. read.csv函数也是如此。

Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.目前我已经制作了一个从第一列名称中删除 BOM 的函数,但我相信应该有一种方法可以自动剥离 BOM。

remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))

> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"

I am using the native encoding for the R session:我正在为 R 会话使用本机编码:

> options("encoding" = "")
> options("encoding")
$encoding
[1] ""

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM") ?.您是否尝试过read.csv(..., fileEncoding = "UTF-8-BOM") ?file says: ?file说:

As from R 3.0.0 the encoding '"UTF-8-BOM"' is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).从 R 3.0.0 开始,编码 '"UTF-8-BOM"' 被接受,如果存在字节顺序标记(它通常用于 Microsoft 应用程序生成的文件和网页),它将删除字节顺序标记。

This was handled between versions 1.9.6 and 1.9.8 with this commit ;这是在版本 1.9.6 和 1.9.8 之间使用此提交处理的; update your data.table installation to fix this.更新您的data.table安装以解决此问题。

Once done, you can just use fread :完成后,您可以使用fread

fread("file_name.csv")

I know it's been 8 years but I just had this problem and came across this so it might help.我知道已经 8 年了,但我刚刚遇到了这个问题并且遇到了这个问题,所以它可能会有所帮助。 An important detail (mentioned by hadley above) is that it needs to be fileEncoding="UTF-8-BOM" not just encoding="UTF-8-BOM".一个重要的细节(上面的 hadley 提到)是它需要是 fileEncoding="UTF-8-BOM" 而不仅仅是 encoding="UTF-8-BOM"。 "encoding" works for a few options but not UTF-8-BOM. “编码”适用于一些选项,但不适用于 UTF-8-BOM。 Go figure.去搞清楚。 Found this out here: https://www.johndcook.com/blog/2019/09/07/excel-r-bom/在这里找到了这个: https ://www.johndcook.com/blog/2019/09/07/excel-r-bom/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM