简体   繁体   中英

Grouped function across multiple columns

I'm trying to find the minimum value across multiple columns by a factor, and then subtracting that minimum value from the original dataframe. So say I have this data:

testdata <-  data.frame(
  category=factor(rep(c("a","j"),each=6,times=8)), 
  num1=(sample(0:15, 96, replace=TRUE)) + 5, 
  num2=(seq(1:96))
)

I am looking to find the minimum value for columns num1 and num2, by each 'category' (a and j). In real life, my factor variable is more complex and have a large number of numeric variables.

Best I could do is something like this:

test2 <- by(testdata, testdata[,"category"], function(x){
  y <- as.data.frame(apply(x[, c(2:3)], 2, min))
})

And bringing it back together:

test3 <- do.call(rbind, lapply(test2, data.frame, stringsAsFactors=FALSE))

Which seems to work, but I'm a little stuck on how to subtract that minimum value by group. A rough idea of what I want to accomplish with sqldf:

testdata4 <- sqldf("select a.category, 
                   a.num1-b.num1 as num1, 
                   a.num2-b.num2 as num2 
                   from testdata a left join testdata3 b 
                   on a.category = b.category")

Although I don't want to specify each new variable. Any thoughts?

Using tidyverse :

library(tidyverse)
# Use set.seed(x) before generating data for future Q's to allow easy checks
#   of the desired output
set.seed(123)

testdata <-  data.frame(
    category=factor(rep(c("a","j"),each=6,times=8)), 
    num1=(sample(0:15, 96, replace=TRUE)) + 5, 
    num2=(seq(1:96))
)

# Generate those same minimums (note that you don't have to do this, just
# showing that you get the same results as your original code)
testdata %>%
    group_by(category) %>%
    summarize(num1 = min(num1), num2 = min(num2))

# Subtract them from the actual data
testdata %>%
    group_by(category) %>%
    mutate(num1_normed = num1 - min(num1),
           num2_normed = num2 - min(num2))

Or if you have lots of columns and want to automatically apply this to all of them:

# Applies the function to all columns except 'category', the group_by column
testdata %>%
    group_by(category) %>%
    mutate_all(function(x) { x - min(x)})

Here are some approaches using only base R. The ave approach maintains the order of rows.

1) by Use by as in the question but with sweep :

Sweep <- function(x) cbind(x[1], sweep(x[-1], 2, apply(x[-1], 2, min), "-"))
do.call("rbind", by(testdata, testdata[[1]], Sweep))

2) ave lapply ave over the columns, except the first, using x-min(x) to give a list of columns L and then, since ave maintains order, in the second line replace the original columns with their modification.

L <- lapply(testdata[-1], function(x) ave(x, testdata[[1]], FUN = function(x) x - min(x)))
replace(tesdata, -1, L)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM