简体   繁体   中英

Find max/min for column with pattern in R

I am having issues developing a data.table that gives me max/min based on multiple columns that share a name pattern.

This is a simplified table:

int <- seq(as.POSIXct("2016-04-08"), as.POSIXct("2016-04-10"), by="6 h")
df <- data.frame(date = int, x_01 = runif(9), x_02 = runif(9), x_10 = runif(9), b_31 = runif(9))
df$date <- format(as.POSIXct(df$date), format = "%Y-%m-%dM")

I know how to obtain summary statistics by applying the following code:

sum <- setDT(df)[, list(x_01min=min(x_01), x_01max=max(x_01)), by=list(date)]

My goal is to get summary statistics for all columns with the pattern "x_" I have attempted nesting for loops and using lapply with grep but cannot seems to get the desired results. The code below should show what I am trying to get at.

sum <- setDT(df)[, list(x_01min=min(x_01), x_01max=max(x_01),
                    x_02min=min(x_02), x_02max=max(x_02),
                    x_10min=min(x_10), x_10max=max(x_10)), by=list(date)]

Ideally column names for the summary table should incorporate names from the original table. My actual data set consists of multiple data frames with different number of columns matching the pattern. As I gather more data new variables will be added so it is important to be able to execute the function based on the colname pattern.

Your help is appreciated!

library(data.table);
setDT(df); ## ensure df is a data.table

cns <- grep(value=T,'^x_',names(df));
df[,do.call(c,lapply(cns,function(cn) { x <- get(cn); setNames(nm=paste0(cn,c('min','max')),.(min(x),max(x))); })),.(date)];
##           date   x_01min   x_01max    x_02min   x_02max    x_10min    x_10max
## 1: 2016-04-08M 0.2655087 0.9082078 0.06178627 0.6870228 0.21214252 0.93470523
## 2: 2016-04-09M 0.2016819 0.9446753 0.38410372 0.7698414 0.12555510 0.65167377
## 3: 2016-04-10M 0.6291140 0.6291140 0.99190609 0.9919061 0.01339033 0.01339033

First, the target column names are derived by calling grep() with the value=T argument. These names are stored in cns in the global environment.

Then, the data.table is indexed, grouping on date .

For each group, lapply() is executed over the cns vector, taking the current column name as parameter cn .

Within the lambda, the column vector is retrieved and stored in a local variable x by calling get() on cn , which works because a data.table's columns are always visible to the j argument expression.

Finally, the summary statistics are computed in a list using .() , and their names are set using setNames() which allows us to dynamically compute them from cn with paste0() .

The result of the lapply() call will be a list of lists, but because we need to produce a single non-nested list for the group aggregation result, we have to run it through do.call(c,...) to unnest the nested lists. An alternative here would be unlist(recursive=F,...) . Both alternatives preserve the names of the nested lists, which is what we want.


Benchmarking

library(data.table);
library(microbenchmark);

bgoldst <- function(df) { cns <- grep(value=T,'^x_',names(df)); df[,do.call(c,lapply(cns,function(cn) { x <- get(cn); setNames(nm=paste0(cn,c('min','max')),.(min(x),max(x))); })),.(date)]; };
kunal <- function(df) { indices <- grep('x_',colnames(df)); col_names <- colnames(df)[indices]; query_min <- paste0(col_names,'min=min(',col_names,')'); query_max <- paste0(col_names,'max=max(',col_names,')'); query_1 <- paste(c(query_min,query_max),collapse=','); eval(parse(text=paste0('df[,.(',query_1,'),by=date]'))); };
psidom <- function(df) { cols <- names(df)[grepl('x_',names(df))]; newCols <- paste0(rep(cols,each=2),c('max','min')); sumFun <- function(col) list(max(col),min(col)); df[,c(newCols):=unlist(lapply(.SD,sumFun),recursive=F),.(date),.SDcols=cols]; unique(df[,.SD,.SDcols=c('date',newCols)]); };

set.seed(1L);
int <- seq(as.POSIXct('2016-04-08'),as.POSIXct('2016-04-10'),by='6 h');
df <- data.frame(date=int,x_01=runif(9L),x_02=runif(9L),x_10=runif(9L),b_31=runif(9L));
df$date <- format(as.POSIXct(df$date),format='%Y-%m-%dM');
setDT(df);

expected <- bgoldst(copy(df)); co <- names(expected);
identical(expected,kunal(copy(df))[,co,with=F]);
## [1] TRUE
identical(expected,psidom(copy(df))[,co,with=F]);
## [1] TRUE

microbenchmark(bgoldst(copy(df)),kunal(copy(df)),psidom(copy(df)));
## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max neval
##  bgoldst(copy(df)) 1.397569 1.445893 1.522512 1.490369 1.538908 2.749805   100
##    kunal(copy(df)) 1.318453 1.362287 1.483356 1.403555 1.443968 4.733684   100
##   psidom(copy(df)) 1.451881 1.532920 1.625494 1.573120 1.624010 3.097487   100

set.seed(1L);
NR <- 500L; NC <- 100L;
df <- data.frame(
    date=seq(as.POSIXct('2016-04-08'),by='6 h',len=NR),
    setNames(nm=paste0('x_',seq_len(NC)),as.data.frame(replicate(NC,runif(NR)))),
    b_31=runif(NR)
);
df$date <- format(as.POSIXct(df$date),format='%Y-%m-%dM');
setDT(df);

expected <- bgoldst(copy(df)); co <- names(expected);
identical(expected,kunal(copy(df))[,co,with=F]);
## [1] TRUE
identical(expected,psidom(copy(df))[,co,with=F]);
## [1] TRUE

microbenchmark(bgoldst(copy(df)),kunal(copy(df)),psidom(copy(df)));
## Unit: milliseconds
##               expr      min        lq      mean    median        uq       max neval
##  bgoldst(copy(df)) 94.75322 100.94627 106.61343 102.37655 105.89292 164.58885   100
##    kunal(copy(df)) 21.38946  23.04383  24.60639  24.20192  25.18723  69.29593   100
##   psidom(copy(df)) 45.32431  48.76798  50.63476  49.60532  51.03667  92.41567   100

You may try this code:

## building the data.table
int <- seq(as.POSIXct("2016-04-08"), as.POSIXct("2016-04-10"), by="6 h")
df <- data.frame(date = int, x_01 = runif(9), x_02 = runif(9), x_10 = runif(9), b_31 = runif(9))
df$date <- format(as.POSIXct(df$date), format = "%Y-%m-%dM")

## actual work begins here
library(data.table)
setDT(df)

indices <- grep("x_",colnames(df))

col_names <- colnames(df)[indices]

query_min <- paste0(col_names,'min=min(',col_names,')')

query_max <- paste0(col_names,'max=max(',col_names,')')

query_1 <- paste(c(query_min,query_max),collapse=',')

eval(parse(text=paste0("df[,.(",query_1,"),by=date]")))

##          date    x_01min     x_02min   x_10min   x_01max     x_02max   x_10max
##1: 2016-04-08M 0.07527176 0.026276086 0.3315467 0.9404001 0.906662120 0.7069425
##2: 2016-04-09M 0.34796983 0.065390319 0.2437374 0.8130796 0.739978420 0.6760062
##3: 2016-04-10M 0.45671821 0.003374905 0.7245515 0.4567182 0.003374905 0.7245515
cols <- names(df)[grepl("x_", names(df))]
newCols <- paste0(rep(cols, each = 2), c("max", "min"))
sumFun <- function(col) list(max(col), min(col))
setDT(df)[, c(newCols) := unlist(lapply(.SD, sumFun), recursive = F), .(date), .SDcols = cols]
sum <- unique(df[, .SD, .SDcols = c("date", newCols)])
> sum
          date   x_01max   x_01min    x_02max     x_02min   x_10max   x_10min
1: 2016-04-08M 0.8770486 0.1828969 0.99869872 0.159936264 0.8983131 0.3767007
2: 2016-04-09M 0.6475017 0.1429131 0.57890510 0.007439883 0.9242098 0.1077233
3: 2016-04-10M 0.9176341 0.9176341 0.05900942 0.059009423 0.2717861 0.2717861

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM