简体   繁体   中英

Numeric values in r and dealing with missing values

Using the example dataframe:

df <- structure(list(
  KY27PHY1 = c("4", "5", "5", "4", "-", "4", "2","3", "5", "-", "4", "3", "3", "5", "5"),
  KY27PHY2 = c("4", "4","4", "4", "-", "5", "2", "3", "5", "-", "5", "3", "3", "5", "5"),
  KY27PHY3 = c("5", "4", "4", "4", "-", "5", "1", "4", "5","-", "4", "3", "3", "5", "5")),
                .Names = c("KY27PHY1", "KY27PHY2","KY27PHY3"),
                row.names = 197:211,
                class = "data.frame")

I have been using the following code to convert the values to numeric:

df$KY27PHY1<-as.numeric(df$KY27PHY1)
df$KY27PHY2<-as.numeric(df$KY27PHY2)
df$KY27PHY3<-as.numeric(df$KY27PHY3)

Since I have missing values in the df dataframe, I always get the warning message:

Warning message:
NAs introduced by coercion 

I presume this isn't a problem, but I just wanted some advice of how I might improve the code so I don't get this message.

Also, how I can do all the columns (specified by name) in one go?

Many thanks in advance.

I see two possibilities:

  1. the unlikely one is that you built your data.frame in R. Then, just change your code to create integer vectors in the first place, or replace - with NA so the as.numeric conversion won't complain.

  2. The more likely one is that your data.frame came from outside R and you probably read it with one of the read.table or read.csv functions. Then just add na.strings = "-" to your call and R will know that these - are to be understood as NA . Also, if there are no other weird items in these columns, the type.convert function called inside these functions will automatically detect that these are columns full of integers and store them as such.

data.table is super fast, you should use it as soon as you work with data.frames . for your question that would be :

library(data.table)
dt = as.data.table(df)
dt[,lapply(.SD,as.numeric)]
    KY27PHY1 KY27PHY2 KY27PHY3
 1:        4        4        5
 2:        5        4        4
 3:        5        4        4
 4:        4        4        4
 5:       NA       NA       NA
 6:        4        5        5
 7:        2        2        1
 8:        3        3        4
 9:        5        5        5
10:       NA       NA       NA
11:        4        5        4
12:        3        3        3
13:        3        3        3
14:        5        5        5
15:        5        5        5

Off course you get some warnings as "-" cannot be converted to a number

You can use sapply to do them all at once, but you will end up with a matrix so you have to wrap in an as.data.frame to convert back. The warnings are just there to tell you that there were characters in your original data that could not be matched to a number, so were replaced by NA . In your case these characters were "-" . To ensure the warnings do not print, use suppressWarnings :

suppressWarnings(as.data.frame(sapply(df,as.numeric)))
   KY27PHY1 KY27PHY2 KY27PHY3
1         4        4        5
2         5        4        4
3         5        4        4
4         4        4        4
5        NA       NA       NA
6         4        5        5
7         2        2        1
8         3        3        4
9         5        5        5
10       NA       NA       NA
11        4        5        4
12        3        3        3
13        3        3        3
14        5        5        5
15        5        5        5

I wrote a small function some time back to handle making certain values in a data.frame as NA and using type.convert to convert the output, as if you had used read.table with na.strings specified.

Here's the function:

makemeNA <- function(mydf, NAStrings, fixed = TRUE) {
  dfname <- deparse(substitute(mydf))
  if (!isTRUE(fixed)) {
    mydf <- data.frame(lapply(mydf, function(x) gsub(NAStrings, "", x)))
    NAStrings <- ""
  }
  mydf <- data.frame(lapply(mydf, function(x) type.convert(
    as.character(x), na.strings = NAStrings)))
  mydf
}

Here it is in use:

makemeNA(df, "-")
#    KY27PHY1 KY27PHY2 KY27PHY3
# 1         4        4        5
# 2         5        4        4
# 3         5        4        4
# 4         4        4        4
# 5        NA       NA       NA
# 6         4        5        5
# 7         2        2        1
# 8         3        3        4
# 9         5        5        5
# 10       NA       NA       NA
# 11        4        5        4
# 12        3        3        3
# 13        3        3        3
# 14        5        5        5
# 15        5        5        5

You can see from the str ucture that we now have numeric output.

str(makemeNA(df, "-"))
# 'data.frame':  15 obs. of  3 variables:
#  $ KY27PHY1: int  4 5 5 4 NA 4 2 3 5 NA ...
#  $ KY27PHY2: int  4 4 4 4 NA 5 2 3 5 NA ...
#  $ KY27PHY3: int  5 4 4 4 NA 5 1 4 5 NA ...

As with na.strings , the NAStrings in makemeNA is plural . Here we make a dash and the values "1" into NA .

str(makemeNA(df, c("-", 1)))
# 'data.frame':  15 obs. of  3 variables:
#  $ KY27PHY1: int  4 5 5 4 NA 4 2 3 5 NA ...
#  $ KY27PHY2: int  4 4 4 4 NA 5 2 3 5 NA ...
#  $ KY27PHY3: int  5 4 4 4 NA 5 NA 4 5 NA ...

You can also use regular expressions to set values as NA , as below:

df1 <- data.frame(A = c(1, 2, "-", "not applicable", 5),
                 B = c("not available", 1, 2, 3, 4),
                 C = c("-", letters[1:4]))

Make any values with "not" or "-" into NA :

makemeNA(df1, "not.*|-", fixed = FALSE)
#    A  B    C
# 1  1 NA <NA>
# 2  2  1    a
# 3 NA  2    b
# 4 NA  3    c
# 5  5  4    d
str(makemeNA(df1, "not.*|-", fixed = FALSE))
# 'data.frame':  5 obs. of  3 variables:
#  $ A: int  1 2 NA NA 5
#  $ B: int  NA 1 2 3 4
#  $ C: Factor w/ 4 levels "a","b","c","d": NA 1 2 3 4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM