Reading Rdata file with different encoding

Question

I have an .RData file to read on my Linux (UTF-8) machine, but I know the file is in Latin1 because I've created them myself on Windows. Unfortunately, I don't have access to the original files or a Windows machine and I need to read those files on my Linux machine.

To read an Rdata file, the normal procedure is to run load("file.Rdata") . Functions such as read.csv have an encoding argument that you can use to solve those kind of issues, but load has no such thing. If I try load("file.Rdata", encoding = latin1) , I just get this (expected) error:

Error in load("file.Rdata", encoding = "latin1") : unused argument (encoding = "latin1")

What else can I do? My files are loaded with text variables containing accents that get corrupted when opened in an UTF-8 environment.

Answer 1

Thanks to 42's comment, I've managed to write a function to recode the file:

fix.encoding <- function(df, originalEncoding = "latin1") {
  numCols <- ncol(df)
  for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
  return(df)
}

The meat here is the command Encoding(df[, col]) <- "latin1" , which takes column col of dataframe df and converts it to latin1 format. Unfortunately, Encoding only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.

Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input). Also, if you're facing the inverse problem, ie reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8" .

Answer 2

following up on previous answers, this is a minor update which makes it work on factors and dplyr's tibble. Thanks for inspiration.

fix.encoding <- function(df, originalEncoding = "UTF-8") {
numCols <- ncol(df)
df <- data.frame(df)
for (col in 1:numCols)
{
        if(class(df[, col]) == "character"){
                Encoding(df[, col]) <- originalEncoding
        }

        if(class(df[, col]) == "factor"){
                        Encoding(levels(df[, col])) <- originalEncoding
}
}
return(as_data_frame(df))
}

Answer 3

Thank you for posting this. I took the liberty to modify your function in case you have a dataframe with some columns as character and some as non-character. Otherwise, an error occurs:

> fix.encoding(adress)
Error in `Encoding<-`(`*tmp*`, value = "latin1") :
 a character vector argument expected

So here is the modified function:

fix.encoding <- function(df, originalEncoding = "latin1") {
    numCols <- ncol(df)
    for (col in 1:numCols)
            if(class(df[, col]) == "character"){
                    Encoding(df[, col]) <- originalEncoding
            }
    return(df)
}

However, this will not change the encoding of level's names in a "factor" column. Luckily, I found this to change all factors in your dataframe to character (which may be not the best approach, but in my case that's what I needed):

i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)

Answer 4

Another option using dplyr's mutate_if :

fix_encoding <- function(x) {
  Encoding(x) <- "latin1"
  return(x)
}
data <- data %>% 
  mutate_if(is.character,fix_encoding)

And for factor variables that have to be recoded:

fix_encoding_factor <- function(x) {
  x <- as.character(x)
  Encoding(x) <- "latin1"
  x <- as.factor(x)
  return(x)
}
data <- data %>% 
  mutate_if(is.factor,fix_encoding_factor)

Reading Rdata file with different encoding

Question

4 answers

solution1
7 ACCPTED 2015-12-02 15:27:53

solution2
2 2016-11-08 02:28:59

solution3
1 2016-10-21 07:38:50

solution4
0 2020-08-27 12:52:23

Reading Rdata file with different encoding

Question

4 answers

solution1 7 ACCPTED 2015-12-02 15:27:53

solution2 2 2016-11-08 02:28:59

solution3 1 2016-10-21 07:38:50

solution4 0 2020-08-27 12:52:23

solution1
7 ACCPTED 2015-12-02 15:27:53

solution2
2 2016-11-08 02:28:59

solution3
1 2016-10-21 07:38:50

solution4
0 2020-08-27 12:52:23