[英]Reading Rdata file with different encoding
I have an .RData file to read on my Linux (UTF-8) machine, but I know the file is in Latin1 because I've created them myself on Windows.我有一个 .RData 文件要在我的 Linux (UTF-8) 机器上读取,但我知道该文件是 Latin1,因为我自己在 Windows 上创建了它们。 Unfortunately, I don't have access to the original files or a Windows machine and I need to read those files on my Linux machine.
不幸的是,我无法访问原始文件或 Windows 机器,我需要在我的 Linux 机器上读取这些文件。
To read an Rdata file, the normal procedure is to run load("file.Rdata")
.要读取 Rdata 文件,正常过程是运行
load("file.Rdata")
。 Functions such as read.csv
have an encoding
argument that you can use to solve those kind of issues, but load
has no such thing. read.csv
函数有一个encoding
参数,您可以使用它来解决这类问题,但load
没有这样的东西。 If I try load("file.Rdata", encoding = latin1)
, I just get this (expected) error:如果我尝试
load("file.Rdata", encoding = latin1)
,我只会得到这个(预期的)错误:
Error in load("file.Rdata", encoding = "latin1") : unused argument (encoding = "latin1")
加载错误(“file.Rdata”,编码 =“latin1”):未使用的参数(编码 =“latin1”)
What else can I do?我还可以做些什么? My files are loaded with text variables containing accents that get corrupted when opened in an UTF-8 environment.
我的文件加载了包含在 UTF-8 环境中打开时会损坏的重音的文本变量。
Thanks to 42's comment, I've managed to write a function to recode the file:感谢 42 的评论,我设法编写了一个函数来重新编码文件:
fix.encoding <- function(df, originalEncoding = "latin1") {
numCols <- ncol(df)
for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
return(df)
}
The meat here is the command Encoding(df[, col]) <- "latin1"
, which takes column col
of dataframe df
and converts it to latin1 format.这里的主要内容是命令
Encoding(df[, col]) <- "latin1"
,它获取数据帧df
列col
并将其转换为 latin1 格式。 Unfortunately, Encoding
only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.不幸的是,
Encoding
只接受列对象作为输入,所以我必须创建一个函数来扫描数据帧对象的所有列并应用转换。
Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding
to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input).当然,如果您的问题仅在几列中,您最好只将
Encoding
应用于这些列而不是整个数据帧(您可以修改上面的函数以将一组列作为输入)。 Also, if you're facing the inverse problem, ie reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8"
.此外,如果您面临相反的问题,即将在 Linux 或 Mac OS 中创建的 R 对象读入 Windows,您应该使用
originalEncoding = "UTF-8"
。
following up on previous answers, this is a minor update which makes it work on factors and dplyr's tibble.跟进以前的答案,这是一个小更新,使其适用于因子和 dplyr 的 tibble。 Thanks for inspiration.
谢谢你的灵感。
fix.encoding <- function(df, originalEncoding = "UTF-8") {
numCols <- ncol(df)
df <- data.frame(df)
for (col in 1:numCols)
{
if(class(df[, col]) == "character"){
Encoding(df[, col]) <- originalEncoding
}
if(class(df[, col]) == "factor"){
Encoding(levels(df[, col])) <- originalEncoding
}
}
return(as_data_frame(df))
}
Thank you for posting this.感谢您发布此信息。 I took the liberty to modify your function in case you have a dataframe with some columns as character and some as non-character.
我冒昧地修改了你的函数,以防你有一个数据框,其中一些列作为字符,一些列作为非字符。 Otherwise, an error occurs:
否则会出现错误:
> fix.encoding(adress)
Error in `Encoding<-`(`*tmp*`, value = "latin1") :
a character vector argument expected
So here is the modified function:所以这是修改后的函数:
fix.encoding <- function(df, originalEncoding = "latin1") {
numCols <- ncol(df)
for (col in 1:numCols)
if(class(df[, col]) == "character"){
Encoding(df[, col]) <- originalEncoding
}
return(df)
}
However, this will not change the encoding of level's names in a "factor" column.但是,这不会更改“因子”列中级别名称的编码。 Luckily, I found this to change all factors in your dataframe to character (which may be not the best approach, but in my case that's what I needed):
幸运的是,我发现这可以将数据框中的所有因素更改为字符(这可能不是最好的方法,但在我的情况下,这正是我需要的):
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)
Another option using dplyr's mutate_if
:使用 dplyr 的
mutate_if
另一个选项:
fix_encoding <- function(x) {
Encoding(x) <- "latin1"
return(x)
}
data <- data %>%
mutate_if(is.character,fix_encoding)
And for factor variables that have to be recoded:对于必须重新编码的因子变量:
fix_encoding_factor <- function(x) {
x <- as.character(x)
Encoding(x) <- "latin1"
x <- as.factor(x)
return(x)
}
data <- data %>%
mutate_if(is.factor,fix_encoding_factor)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.