简体   繁体   中英

Removing special characters from factor and converting to numeric in R

I need to convert a messy factor into a numeric. The sample data looks like this:

x <- structure(c(4L, 5L, 1L, 6L, 6L, 2L, 3L), 
    .Label = c("", "106", "39", "8", "80", "chyb\x92 foto"), class = "factor")

My desired output would be:

x
[1]   8  80  NA  NA  NA 106  39
class(x)
"numeric"

However, the first line of my intended code results in a warning and the text is not replaced with NAs .

x[grepl("[a-z]", x) | x==""] <- NA
x <- as.numeric(levels(x))[x]

Warning messages:
1: In grepl("[az]", x) : input string 4 is invalid in this locale
2: In grepl("[az]", x) : input string 5 is invalid in this locale

The second line then runs correctly and provides the correct output with NAs introduced by coercion. Why does grepl fail to recognise letters in some factor levels, and how can as.numeric pick them out and replace them with NAs ?

The factor to numeric conversion was chosen from this question . However, the fact that it works does not answer my question why.

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] cs_CZ.UTF-8/cs_CZ.UTF-8/cs_CZ.UTF-8/C/cs_CZ.UTF-8/cs_CZ.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.3.0

We can just do

as.numeric(as.character(x))
#[1]   8  80  NA  NA  NA 106  39

If we are using grepl , we will make sure that we are only finding the numeric part from start ( ^ ) to end ( $ ) of string and negate ( ! ) it and then assign those values to NA. As 'x' is a factor , we can convert to numeric by as.numeric(as.character .

 x[!grepl("^[0-9.]+$", x)] <- NA
 as.numeric(as.character(x))
 #[1]   8  80  NA  NA  NA 106  39

It seems I found the solution. Thanks to akrun, Cath and Tensibai for pointing me towards Encoding . My levels(x) were encoded as "unknown", for which grepl found values with text when it was instructed to read bytes :

grepl("[a-z]", x, useBytes = TRUE)
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

Tensibar's suggestion to specify the encoding provides identical functionality for grepl .

levels(x) <- enc2utf8(levels(x))
grepl("[a-z]", x, useBytes = FALSE)

Unlike complex ability of grepl to deal with accented characters and various encoding, as.numeric takes an object and finds if it can be interpretable as a number. Which any text, regardless of encoding, is not.

Using as.numeric(levels(x))[x] for factor conversion might be a safe method to use by itself without the need to check for problematic values first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM