简体   繁体   English

读取不同编码的 Rdata 文件

[英]Reading Rdata file with different encoding

I have an .RData file to read on my Linux (UTF-8) machine, but I know the file is in Latin1 because I've created them myself on Windows.我有一个 .RData 文件要在我的 Linux (UTF-8) 机器上读取,但我知道该文件是 Latin1,因为我自己在 Windows 上创建了它们。 Unfortunately, I don't have access to the original files or a Windows machine and I need to read those files on my Linux machine.不幸的是,我无法访问原始文件或 Windows 机器,我需要在我的 Linux 机器上读取这些文件。

To read an Rdata file, the normal procedure is to run load("file.Rdata") .要读取 Rdata 文件,正常过程是运行load("file.Rdata") Functions such as read.csv have an encoding argument that you can use to solve those kind of issues, but load has no such thing. read.csv函数有一个encoding参数,您可以使用它来解决这类问题,但load没有这样的东西。 If I try load("file.Rdata", encoding = latin1) , I just get this (expected) error:如果我尝试load("file.Rdata", encoding = latin1) ,我只会得到这个(预期的)错误:

Error in load("file.Rdata", encoding = "latin1") : unused argument (encoding = "latin1")加载错误(“file.Rdata”,编码 =“latin1”):未使用的参数(编码 =“latin1”)

What else can I do?我还可以做些什么? My files are loaded with text variables containing accents that get corrupted when opened in an UTF-8 environment.我的文件加载了包含在 UTF-8 环境中打开时会损坏的重音的文本变量。

Thanks to 42's comment, I've managed to write a function to recode the file:感谢 42 的评论,我设法编写了一个函数来重新编码文件:

fix.encoding <- function(df, originalEncoding = "latin1") {
  numCols <- ncol(df)
  for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
  return(df)
}

The meat here is the command Encoding(df[, col]) <- "latin1" , which takes column col of dataframe df and converts it to latin1 format.这里的主要内容是命令Encoding(df[, col]) <- "latin1" ,它获取数据帧dfcol并将其转换为 latin1 格式。 Unfortunately, Encoding only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.不幸的是, Encoding只接受列对象作为输入,所以我必须创建一个函数来扫描数据帧对象的所有列并应用转换。

Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input).当然,如果您的问题仅在几列中,您最好只将Encoding应用于这些列而不是整个数据帧(您可以修改上面的函数以将一组列作为输入)。 Also, if you're facing the inverse problem, ie reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8" .此外,如果您面临相反的问题,即将在 Linux 或 Mac OS 中创建的 R 对象读入 Windows,您应该使用originalEncoding = "UTF-8"

following up on previous answers, this is a minor update which makes it work on factors and dplyr's tibble.跟进以前的答案,这是一个小更新,使其适用于因子和 dplyr 的 tibble。 Thanks for inspiration.谢谢你的灵感。

fix.encoding <- function(df, originalEncoding = "UTF-8") {
numCols <- ncol(df)
df <- data.frame(df)
for (col in 1:numCols)
{
        if(class(df[, col]) == "character"){
                Encoding(df[, col]) <- originalEncoding
        }

        if(class(df[, col]) == "factor"){
                        Encoding(levels(df[, col])) <- originalEncoding
}
}
return(as_data_frame(df))
}

Thank you for posting this.感谢您发布此信息。 I took the liberty to modify your function in case you have a dataframe with some columns as character and some as non-character.我冒昧地修改了你的函数,以防你有一个数据框,其中一些列作为字符,一些列作为非字符。 Otherwise, an error occurs:否则会出现错误:

> fix.encoding(adress)
Error in `Encoding<-`(`*tmp*`, value = "latin1") :
 a character vector argument expected

So here is the modified function:所以这是修改后的函数:

fix.encoding <- function(df, originalEncoding = "latin1") {
    numCols <- ncol(df)
    for (col in 1:numCols)
            if(class(df[, col]) == "character"){
                    Encoding(df[, col]) <- originalEncoding
            }
    return(df)
}

However, this will not change the encoding of level's names in a "factor" column.但是,这不会更改“因子”列中级别名称的编码。 Luckily, I found this to change all factors in your dataframe to character (which may be not the best approach, but in my case that's what I needed):幸运的是,我发现这可以将数据框中的所有因素更改为字符(这可能不是最好的方法,但在我的情况下,这正是我需要的):

i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)

Another option using dplyr's mutate_if :使用 dplyr 的mutate_if另一个选项:

fix_encoding <- function(x) {
  Encoding(x) <- "latin1"
  return(x)
}
data <- data %>% 
  mutate_if(is.character,fix_encoding) 

And for factor variables that have to be recoded:对于必须重新编码的因子变量:

fix_encoding_factor <- function(x) {
  x <- as.character(x)
  Encoding(x) <- "latin1"
  x <- as.factor(x)
  return(x)
}
data <- data %>% 
  mutate_if(is.factor,fix_encoding_factor) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM