I need to convert a messy factor into a numeric. The sample data looks like this:
x <- structure(c(4L, 5L, 1L, 6L, 6L, 2L, 3L),
.Label = c("", "106", "39", "8", "80", "chyb\x92 foto"), class = "factor")
My desired output would be:
x
[1] 8 80 NA NA NA 106 39
class(x)
"numeric"
However, the first line of my intended code results in a warning and the text is not replaced with NAs
.
x[grepl("[a-z]", x) | x==""] <- NA
x <- as.numeric(levels(x))[x]
Warning messages:
1: In grepl("[az]", x) : input string 4 is invalid in this locale
2: In grepl("[az]", x) : input string 5 is invalid in this locale
The second line then runs correctly and provides the correct output with NAs
introduced by coercion. Why does grepl
fail to recognise letters in some factor levels, and how can as.numeric
pick them out and replace them with NAs
?
The factor to numeric conversion was chosen from this question . However, the fact that it works does not answer my question why.
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)
locale:
[1] cs_CZ.UTF-8/cs_CZ.UTF-8/cs_CZ.UTF-8/C/cs_CZ.UTF-8/cs_CZ.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0
We can just do
as.numeric(as.character(x))
#[1] 8 80 NA NA NA 106 39
If we are using grepl
, we will make sure that we are only finding the numeric part from start ( ^
) to end ( $
) of string and negate ( !
) it and then assign those values to NA. As 'x' is a factor
, we can convert to numeric
by as.numeric(as.character
.
x[!grepl("^[0-9.]+$", x)] <- NA
as.numeric(as.character(x))
#[1] 8 80 NA NA NA 106 39
It seems I found the solution. Thanks to akrun, Cath and Tensibai for pointing me towards Encoding
. My levels(x)
were encoded as "unknown", for which grepl
found values with text when it was instructed to read bytes
:
grepl("[a-z]", x, useBytes = TRUE)
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
Tensibar's suggestion to specify the encoding provides identical functionality for grepl
.
levels(x) <- enc2utf8(levels(x))
grepl("[a-z]", x, useBytes = FALSE)
Unlike complex ability of grepl
to deal with accented characters and various encoding, as.numeric
takes an object and finds if it can be interpretable as a number. Which any text, regardless of encoding, is not.
Using as.numeric(levels(x))[x]
for factor conversion might be a safe method to use by itself without the need to check for problematic values first.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.