简体   繁体   中英

Why is R reading UTF-8 header as text?

I saved an Excel table as text (*.txt). Unfortunately, Excel don't let me choose the encoding. So I need to open it in Notepad (which opens as ANSI) and save it as UTF-8. Then, when I read it in R:

data <- read.csv("my_file.txt",header=TRUE,sep="\t",encoding="UTF-8")

it shows the name of the first column beginning with "XUFEFF.". I know these are the bytes reserved to tell any program that the file is in UTF-8 format. So it shouldn't appear as text! Is this a bug? Or am I missing some option? Thanks in advance!

So I was going to give you instructions on how to manually open the file and check for and discard the BOM, but then I noticed this (in ?file ):

As from R 3.0.0 the encoding "UTF-8-BOM" is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).

which means that if you have a sufficiently new R interpreter,

read.csv("my_file.txt", fileEncoding="UTF-8-BOM", ...other args...)

should do what you want.

most of the arguments in read.csv are dummy args -- including fileEncoding .

use read.table instead

 read.table("my_file.txt", header=TRUE, sep="\t", fileEncoding="UTF-8")

I had the same issue loading a csv file using either read.csv (with encoding="UTF-87-BOM" ), read.table or read_csv from the readr package. None of these attempt proved successful.

I could definitely not work with the BOM tag because upon sub setting my data (using both approaches subset() or df[df$var=="value",] ), the first row was not taken into account.

I finally found a workaround that made the BOM tag vanish. Using the read.csv function, I just defined a string vector for my column names in the argument col.names = ... . This works like a charm and I can subset my data without issues.

I use R Version 3.5.0

Possible solution from the comments:

Try it with the read.csv argument check.names=FALSE . Note that if you use this, you will not be able to directly reference columns with the $ notation, unless you surround the name in quotes. For instance: yourdf$"first col" .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM