R：对列表内的所有数据帧进行子集

Question

I'm new in the use of R and stackoverflow. 我是使用R和stackoverflow的新手。 I'm trying to deal with a list of data frame and have the following problem (hope, that this is a good example for reproducing). 我正在尝试处理数据框列表并遇到以下问题（希望，这是一个很好的再现示例）。 Assume, that I've a list of 3 data frames with 4 columns (my real code contains 10 data frames with 20 columns): 假设我有一个包含4列的3个数据帧列表（我的实际代码包含10个数据帧，包含20列）：

df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)

For each data frame I've a different condition for subsetting: 对于每个数据帧，我有一个不同的子集条件：

For example: 例如：

#If I would subset them in a singular way outside of the list

df1_s <- df1[which(df1$k <=12 & df1$k >0), df1$h_1] #Taking only rows of k=12 to k=1 
and only the column h_1
df2_s <- df2[which(df2$k <=4 & df2$k >0), df2$h_3]
df3_s <- df3[which(df3$k <=12 & df2$k >0), df2$h_2]

How I can subset the three data frames in the list in a most efficient way ? 如何以最有效的方式对列表中的三个数据框进行子集化？ I think something with lapply and putting the numbers of subsetting in a vector would be good approach, but I've no idea how to do it or how I can subset in lists. 我认为拉伸并将子集的数量放在向量中是一种很好的方法，但我不知道如何做或者如何在列表中进行子集化。

I hope you can help me. 我希望你能帮助我。 Before posting, I tried to find a solution in other posts, that are dealing with subsetting of data frames in lists, but that doesn't work for my Code. 在发布之前，我试图在其他帖子中找到一个解决方案，即处理列表中数据帧的子集，但这对我的代码不起作用。

Answer 1

Here's an mapply approach (same idea as the other answer): 这是一种mapply方法（与其他答案相同的想法）：

# function: w/ arguments dataframe and a vector = [column name, upper, lower]
rook <- function(df, par) {
  out <- df[par[1]][, 1]
  out[out <= par[2] & out > par[3]]
}

# list of parameters
par_list <- list(
  c('h_1', 12, 0),
  c('h_3', 4 , 0),
  c('h_2', 12, 0)
)

# call mapply
mapply(rook, df_list, par_list)

Answer 2

Here's a solution using base R. As @www mentioned, the idea is to use an apply-type function ( mapply or pmap from purrr ) to apply multiple arguments to a function in sequence. 这是使用基础R的解决方案。如@www提到的，想法是使用apply-type函数（来自purrr mapply或pmap ）按顺序将多个参数应用于函数。 This solution also makes use of the eval-parse construct to do flexible subsetting. 该解决方案还利用eval-parse构造来进行灵活的子集化。 See eg the discussion here http://r.789695.n4.nabble.com/using-a-condition-given-as-string-in-subset-function-how-td1676426.html . 参见例如http://r.789695.n4.nabble.com/using-a-condition-given-as-string-in-subset-function-how-td1676426.html中的讨论。

subset_fun <- function(data, criteria, columns) {
  subset(data, eval(parse(text = criteria)), columns)
}

criterion <- list("k <= 12 & k > 0", "k <= 4 & k > 0", "k <= 12 & k > 0")
cols <- list("h_1", "h_3", "h_2")

out <- mapply(subset_fun, df_list, criterion, cols)
str(out)
# List of 3
#  $ df1.h_1: num [1:12] -0.0589 1.0677 0.2122 1.4109 -0.6367 ...
#  $ df2.h_3: num [1:4] -0.826 -1.506 -1.551 0.862
#  $ df3.h_2: num [1:12] 0.8948 0.0305 0.9131 -0.0219 0.2252 ...

Answer 3

We can use the pmap function from the purrr package. 我们可以使用purrr包中的pmap函数。 The key is to define a function to take arguments based on the k and column name, and then organize a list with these arguments, and then use pmap . 关键是定义一个函数来根据k和列名称获取参数，然后用这些参数组织一个列表，然后使用pmap 。

library(tidyverse)

# Define a function 
subset_fun <- function(dat, k1, k2, col){
  dat2 <- dat %>%
    filter(k <= k1, k > k2) %>%
    pull(col)
  return(dat2)
}

# Define lists for the function arguments
par <- list(dat = df_list,                   # List of data frames
            k1 = list(12, 4, 12),            # The first number 
            k2 = list(0, 0, 0),              # The second number
            col = list("h_1", "h_3", "h_2")) # The column name

# Apply the subset_fun
df_list2 <- pmap(par, subset_fun)
df_list2
# $df1
# [1] -0.6868529 -0.4456620  1.2240818  0.3598138  0.4007715  0.1106827 -0.5558411  1.7869131
# [9]  0.4978505 -1.9666172  0.7013559 -0.4727914
# 
# $df2
# [1] -0.9474746 -0.4905574 -0.2560922  1.8438620
# 
# $df3
# [1] -0.2803953  0.5629895 -0.3724388  0.9769734 -0.3745809  1.0527115 -1.0491770 -1.2601552
# [9]  3.2410399 -0.4168576  0.2982276  0.6365697

DATA 数据

set.seed(123)

df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)

Answer 4

Consider Map , the wrapper to mapply to return a list of dataframes. 考虑Map ， mapply的包装器返回数据帧列表。 And because you subset one column, to avoid return as a vector, cast back with data.frame and use setNames to rename. 并且因为您将一列列为子集，以避免作为向量返回， data.frame使用data.frame并使用setNames重命名。

Here, mapply or Map , sibling to lapply , is chosen because you want to iterate element-wise across a list of equal length objects. 在这里，选择mapply或Map ， lapply to lapply ，因为你想在一个等长对象列表中逐个元素地迭代。 Mapply takes an unlimited number of arguments, here being four, requiring lengths to be equal or multiples of lengths: Mapply使用无限数量的参数，这里是四个，要求长度相等或长度的倍数：

low_limits <- c(0, 0, 0)
high_limits <- c(12, 4, 12)
h_cols <- c("h_1", "h_2", "h_3")

subset_fct <- function(df, lo, hi, col)  
               setNames(data.frame(df[which(df$k > lo & df$k <= hi), col]), col)

new_df_list <- Map(subset_fct, df_list, low_limits, high_limits, h_cols)

# EQUIVALENT CALL
new_df_list <- mapply(subset_fct, df_list, low_limits, 
                      high_limits, h_cols, SIMPLIFY = FALSE)

Output (uses set.seed(456) at top to reproduce random numbers) 输出 （在顶部使用set.seed(456)来重现随机数）

new_df_list

# $df1
#           h_1
# 1   1.0073523
# 2   0.5732347
# 3  -0.9158105
# 4   1.3110974
# 5   0.9887263
# 6   1.6539287
# 7  -1.4408052
# 8   1.9473564
# 9   1.7369362
# 10  0.3874833
# 11  2.2800340
# 12  1.5378833

# $df2
#           h_2
# 1  0.11815133
# 2  0.86990262
# 3 -0.09193621
# 4  0.06889879

# $df3
#           h_3
# 1  -1.4122604
# 2  -0.9997605
# 3  -2.3107388
# 4   0.9386188
# 5  -1.3881885
# 6  -0.6116866
# 7   0.3184948
# 8  -0.2354058
# 9   1.0750520
# 10 -0.1007956
# 11  1.0701526
# 12  1.0358389

R：对列表内的所有数据帧进行子集

问题描述

4 个解决方案

解决方案1
2 2018-04-07 20:42:51

解决方案2
2 2018-04-07 20:43:06

解决方案3
1 2018-04-07 20:37:11

解决方案4
1 2018-04-07 20:53:17

R：对列表内的所有数据帧进行子集

问题描述

4 个解决方案

解决方案1 2 2018-04-07 20:42:51

解决方案2 2 2018-04-07 20:43:06

解决方案3 1 2018-04-07 20:37:11

解决方案4 1 2018-04-07 20:53:17

解决方案1
2 2018-04-07 20:42:51

解决方案2
2 2018-04-07 20:43:06

解决方案3
1 2018-04-07 20:37:11

解决方案4
1 2018-04-07 20:53:17