简体   繁体   中英

R:Subsetting over all data frames inside a list

I'm new in the use of R and stackoverflow. I'm trying to deal with a list of data frame and have the following problem (hope, that this is a good example for reproducing). Assume, that I've a list of 3 data frames with 4 columns (my real code contains 10 data frames with 20 columns):

df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)

For each data frame I've a different condition for subsetting:

For example:

#If I would subset them in a singular way outside of the list

df1_s <- df1[which(df1$k <=12 & df1$k >0), df1$h_1] #Taking only rows of k=12 to k=1 
and only the column h_1
df2_s <- df2[which(df2$k <=4 & df2$k >0), df2$h_3]
df3_s <- df3[which(df3$k <=12 & df2$k >0), df2$h_2]

How I can subset the three data frames in the list in a most efficient way ? I think something with lapply and putting the numbers of subsetting in a vector would be good approach, but I've no idea how to do it or how I can subset in lists.

I hope you can help me. Before posting, I tried to find a solution in other posts, that are dealing with subsetting of data frames in lists, but that doesn't work for my Code.

Here's an mapply approach (same idea as the other answer):

# function: w/ arguments dataframe and a vector = [column name, upper, lower]
rook <- function(df, par) {
  out <- df[par[1]][, 1]
  out[out <= par[2] & out > par[3]]
}

# list of parameters
par_list <- list(
  c('h_1', 12, 0),
  c('h_3', 4 , 0),
  c('h_2', 12, 0)
)

# call mapply
mapply(rook, df_list, par_list)

Here's a solution using base R. As @www mentioned, the idea is to use an apply-type function ( mapply or pmap from purrr ) to apply multiple arguments to a function in sequence. This solution also makes use of the eval-parse construct to do flexible subsetting. See eg the discussion here http://r.789695.n4.nabble.com/using-a-condition-given-as-string-in-subset-function-how-td1676426.html .

subset_fun <- function(data, criteria, columns) {
  subset(data, eval(parse(text = criteria)), columns)
}

criterion <- list("k <= 12 & k > 0", "k <= 4 & k > 0", "k <= 12 & k > 0")
cols <- list("h_1", "h_3", "h_2")

out <- mapply(subset_fun, df_list, criterion, cols)
str(out)
# List of 3
#  $ df1.h_1: num [1:12] -0.0589 1.0677 0.2122 1.4109 -0.6367 ...
#  $ df2.h_3: num [1:4] -0.826 -1.506 -1.551 0.862
#  $ df3.h_2: num [1:12] 0.8948 0.0305 0.9131 -0.0219 0.2252 ...

We can use the pmap function from the package. The key is to define a function to take arguments based on the k and column name, and then organize a list with these arguments, and then use pmap .

library(tidyverse)

# Define a function 
subset_fun <- function(dat, k1, k2, col){
  dat2 <- dat %>%
    filter(k <= k1, k > k2) %>%
    pull(col)
  return(dat2)
}

# Define lists for the function arguments
par <- list(dat = df_list,                   # List of data frames
            k1 = list(12, 4, 12),            # The first number 
            k2 = list(0, 0, 0),              # The second number
            col = list("h_1", "h_3", "h_2")) # The column name

# Apply the subset_fun
df_list2 <- pmap(par, subset_fun)
df_list2
# $df1
# [1] -0.6868529 -0.4456620  1.2240818  0.3598138  0.4007715  0.1106827 -0.5558411  1.7869131
# [9]  0.4978505 -1.9666172  0.7013559 -0.4727914
# 
# $df2
# [1] -0.9474746 -0.4905574 -0.2560922  1.8438620
# 
# $df3
# [1] -0.2803953  0.5629895 -0.3724388  0.9769734 -0.3745809  1.0527115 -1.0491770 -1.2601552
# [9]  3.2410399 -0.4168576  0.2982276  0.6365697

DATA

set.seed(123)

df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)

Consider Map , the wrapper to mapply to return a list of dataframes. And because you subset one column, to avoid return as a vector, cast back with data.frame and use setNames to rename.

Here, mapply or Map , sibling to lapply , is chosen because you want to iterate element-wise across a list of equal length objects. Mapply takes an unlimited number of arguments, here being four, requiring lengths to be equal or multiples of lengths:

low_limits <- c(0, 0, 0)
high_limits <- c(12, 4, 12)
h_cols <- c("h_1", "h_2", "h_3")

subset_fct <- function(df, lo, hi, col)  
               setNames(data.frame(df[which(df$k > lo & df$k <= hi), col]), col)

new_df_list <- Map(subset_fct, df_list, low_limits, high_limits, h_cols)

# EQUIVALENT CALL
new_df_list <- mapply(subset_fct, df_list, low_limits, 
                      high_limits, h_cols, SIMPLIFY = FALSE)

Output (uses set.seed(456) at top to reproduce random numbers)

new_df_list

# $df1
#           h_1
# 1   1.0073523
# 2   0.5732347
# 3  -0.9158105
# 4   1.3110974
# 5   0.9887263
# 6   1.6539287
# 7  -1.4408052
# 8   1.9473564
# 9   1.7369362
# 10  0.3874833
# 11  2.2800340
# 12  1.5378833

# $df2
#           h_2
# 1  0.11815133
# 2  0.86990262
# 3 -0.09193621
# 4  0.06889879

# $df3
#           h_3
# 1  -1.4122604
# 2  -0.9997605
# 3  -2.3107388
# 4   0.9386188
# 5  -1.3881885
# 6  -0.6116866
# 7   0.3184948
# 8  -0.2354058
# 9   1.0750520
# 10 -0.1007956
# 11  1.0701526
# 12  1.0358389

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM