简体   繁体   中英

R: summarise multiple column (numeric, character) and remove NAs

I have a data.frame with many columns (~50). Some of them are character, some are numeric and 3 of them I use for grouping.

I need to:

  • remove NAs from numeric columns
  • calculate the mean of each of the numeric columns
  • extract the first element of the character columns

Let's say, we're using modified iris data as below:

data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris[1,] <- NA ## introducing NAs

I have ~50 columns in total, numeric and character mixed together. I've been trying something like:

giris <- group_by(iris, Species, year)
cls <- unlist(sapply(giris, class)) ## find out classes
action <- ifelse(cls == "numeric", "mean", "first")
action <- paste(action)
summarise_each(giris, action)

What I get is means for all columns in a group followed by columns with the first values in respective group. And NAs are not handled... Which is not exactly what I seek...

Help anyone?

You could try this with an if / else in the funs of summarise_each :

iris %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))

Since you have some NA's also in grouping columns, you could add a filter statement:

iris %>% 
  filter(!is.na(Species) & !is.na(year)) %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))
#Source: local data frame [6 x 7]
#Groups: Species [?]
#
#     Species  year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#      (fctr) (dbl)        (dbl)       (dbl)        (dbl)       (dbl) (chr)
#1     setosa  2000        5.025    3.479167       1.4625       0.250   red
#2     setosa  3000        4.984    3.376000       1.4640       0.244   red
#3 versicolor  2000        6.012    2.776000       4.3120       1.344 green
#4 versicolor  3000        5.860    2.764000       4.2080       1.308 green
#5  virginica  2000        6.576    2.928000       5.6400       2.044  blue
#6  virginica  3000        6.600    3.020000       5.4640       2.008  blue

To avoid potential NA's in the color column (or any non-numeric columns), you could modify it to first(na.omit(.)) .


You could also try data.table :

library(data.table)
setDT(iris)
iris[!is.na(Species) & !is.na(year), lapply(.SD, function(x) {
     if(is.numeric(x)) mean(x, na.rm = TRUE) else x[!is.na(x)][1L]}), 
     by = list(Species, year)]
#      Species year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#1:     setosa 2000        5.025    3.479167       1.4625       0.250   red
#2:     setosa 3000        4.984    3.376000       1.4640       0.244   red
#3: versicolor 2000        6.012    2.776000       4.3120       1.344 green
#4: versicolor 3000        5.860    2.764000       4.2080       1.308 green
#5:  virginica 2000        6.576    2.928000       5.6400       2.044  blue
#6:  virginica 3000        6.600    3.020000       5.4640       2.008  blue

I give it a try:

1. For the first point you mention, I would do something like the following (which isn't necessary for the second point):

na.omit(iris[ , which(sapply(iris, class) == "numeric")])

To separate the columns bei either numeric or character , I use the following:

iris[ , which(sapply(iris, class) == "numeric")]
iris[ , which(sapply(iris, class) == "character")]

2. The second task I combine the above line with colMeans :

colMeans(iris[ , which(sapply(iris, class) == "numeric")], na.rm = TRUE)

3. To extract the first element of the character columns, you could simply do:

iris[1, which(sapply(iris, class) == "character")]

In the mentioned case for the iris data, the first row is completly NA, even the character columns, so I would iterate to find the first-non-NA-row

k <- 1
while(any(is.na(FirstCharacterElement <- iris[k, which(sapply(iris, class) == "character")]))){
  k <- k + 1
}

Be careful about the class factor (which break up the code in the case of the iris data where column Species is of class factor and you maybe expect it to be a character column. You can check this with sapply(iris, class) and change it with eg

iris$Species <- as.character(iris$Species) #or with similar column names

When you read in the data you can mention the parameter stringsAsFactors = FALSE of functions read.table , read.csv or similar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM