简体   繁体   中英

Summarise multiple columns using multiple functions using base R and Dplyr

the data is something like this:

> head(r)
  area    peri     shape perm
1 4990 2791.90 0.0903296  6.3
2 7002 3892.60 0.1486220  6.3
3 7558 3930.66 0.1833120  6.3
4 7352 3869.32 0.1170630  6.3
5 7943 3948.54 0.1224170 17.1
6 7979 4010.15 0.1670450 17.1

I want to perform multiple functions on each column, what I currently have is this function:

analysis = function(df){
  measurements = data.frame(attributes = character(),
                            mean = double(),
                            median = double(),
                            variance = double(),
                            IQR = double())
  for (i in 1:ncol(df)){
    names = colnames(df)[i]
    temp = data.frame(attribute = names,
                                   mean = mean(df[,i]),
                                   median = median(df[,i]),
                                   variance = var(df[,i]),
                                   IQR = IQR(df[,i]))
    measurements = rbind(measurements, temp)
  }
  return (measurements)
}

It works well and achieve what I want which gives the following output:

  attribute         mean      median     variance          IQR
1      area 7187.7291667 7487.000000 7.203045e+06 3564.2500000
2      peri 2682.2119375 2536.195000 2.049654e+06 2574.6150000
3     shape    0.2181104    0.198862 6.971657e-03    0.1004083
4      perm  415.4500000  130.500000 1.916848e+05  701.0500000

However, my supervisor said it is not efficient and not thinking in a R way. I also tried summarise_each() and summarise_all(r, funs(mean, median, var, IQR)) but it doesn't achieve what I want and the output doesn't look nice.

What are some other ways to achieve that output only using base R or dplyr.

I suspect your supervisors comment about 'R'-style thinking was about using that for loop. Almost any for loop you write can be replaced by the apply family of functions (eg apply , sapply , lapply etc).

They make it easier to run functions on vectors/data.frames/lists/etc.

Everything you could do using apply functions could be replicated in for loops (usually with similar performance) so using for loops isn't actually a cardinal sin. Why use apply functions? Well... once you learn them you get more succinct code which returns the results of running your functions on your data. Before long, you'll find this sort of code very intuitive, and even more readable than for loops.

Base R

df <- data.frame(
  area = c(4990, 7002, 7558, 7352, 7943),
  peri = c(2791.9, 3892.6, 3930.66, 3869.32, 3948.54),
  shape = c(.0903296, .148622, .183312, .117063, .122417),
  perm = c(6.3, 6.3, 6.3, 6.3, 17.1)
)

sapply(df, function(x) c(mean=mean(x), median=median(x), var=var(x), IQR=IQR(x)))

Your results can be achieved using base::Map :

f <- function(x) {
  desc = base::summary(x)
  c(
    Mean = unname(desc['Mean']),
    Median = unname(desc['Median']),
    Variance = base::sum((x-desc['Mean'])**2)/(length(x)-1),
    IQR = unname(desc['3rd Qu.'] - desc['1st Qu.'])
  )
}

t(as.data.frame(base::Map(f, df)))
#               Mean       Median     Variance          IQR
# area  7137.3333333 7455.0000000 1.241980e+06 757.25000000
# peri  3740.5283333 3911.6300000 2.183447e+05  68.93000000
# shape    0.1381314    0.1355195 1.192633e-03   0.04403775
# perm     9.9000000    6.3000000 3.110400e+01   8.10000000

Apologies

Data:

df <- data.frame(
  area = c(4990, 7002, 7558, 7352, 7943, 7979),
  peri = c(2791.9, 3892.6, 3930.66, 3869.32, 3948.54, 4010.15),
  shape = c(.0903296, .148622, .183312, .117063, .122417, .167045),
  perm = c(6.3, 6.3, 6.3, 6.3, 17.1, 17.1)
)

Hope that's useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM