简体   繁体   English

R:对列表内的所有数据帧进行子集

[英]R:Subsetting over all data frames inside a list

I'm new in the use of R and stackoverflow. 我是使用R和stackoverflow的新手。 I'm trying to deal with a list of data frame and have the following problem (hope, that this is a good example for reproducing). 我正在尝试处理数据框列表并遇到以下问题(希望,这是一个很好的再现示例)。 Assume, that I've a list of 3 data frames with 4 columns (my real code contains 10 data frames with 20 columns): 假设我有一个包含4列的3个数据帧列表(我的实际代码包含10个数据帧,包含20列):

df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)

For each data frame I've a different condition for subsetting: 对于每个数据帧,我有一个不同的子集条件:

For example: 例如:

#If I would subset them in a singular way outside of the list

df1_s <- df1[which(df1$k <=12 & df1$k >0), df1$h_1] #Taking only rows of k=12 to k=1 
and only the column h_1
df2_s <- df2[which(df2$k <=4 & df2$k >0), df2$h_3]
df3_s <- df3[which(df3$k <=12 & df2$k >0), df2$h_2]

How I can subset the three data frames in the list in a most efficient way ? 如何以最有效的方式对列表中的三个数据框进行子集化? I think something with lapply and putting the numbers of subsetting in a vector would be good approach, but I've no idea how to do it or how I can subset in lists. 我认为拉伸并将子集的数量放在向量中是一种很好的方法,但我不知道如何做或者如何在列表中进行子集化。

I hope you can help me. 我希望你能帮助我。 Before posting, I tried to find a solution in other posts, that are dealing with subsetting of data frames in lists, but that doesn't work for my Code. 在发布之前,我试图在其他帖子中找到一个解决方案,即处理列表中数据帧的子集,但这对我的代码不起作用。

Here's an mapply approach (same idea as the other answer): 这是一种mapply方法(与其他答案相同的想法):

# function: w/ arguments dataframe and a vector = [column name, upper, lower]
rook <- function(df, par) {
  out <- df[par[1]][, 1]
  out[out <= par[2] & out > par[3]]
}

# list of parameters
par_list <- list(
  c('h_1', 12, 0),
  c('h_3', 4 , 0),
  c('h_2', 12, 0)
)

# call mapply
mapply(rook, df_list, par_list)

Here's a solution using base R. As @www mentioned, the idea is to use an apply-type function ( mapply or pmap from purrr ) to apply multiple arguments to a function in sequence. 这是使用基础R的解决方案。如@www提到的,想法是使用apply-type函数(来自purrr mapplypmap )按顺序将多个参数应用于函数。 This solution also makes use of the eval-parse construct to do flexible subsetting. 该解决方案还利用eval-parse构造来进行灵活的子集化。 See eg the discussion here http://r.789695.n4.nabble.com/using-a-condition-given-as-string-in-subset-function-how-td1676426.html . 参见例如http://r.789695.n4.nabble.com/using-a-condition-given-as-string-in-subset-function-how-td1676426.html中的讨论。

subset_fun <- function(data, criteria, columns) {
  subset(data, eval(parse(text = criteria)), columns)
}

criterion <- list("k <= 12 & k > 0", "k <= 4 & k > 0", "k <= 12 & k > 0")
cols <- list("h_1", "h_3", "h_2")

out <- mapply(subset_fun, df_list, criterion, cols)
str(out)
# List of 3
#  $ df1.h_1: num [1:12] -0.0589 1.0677 0.2122 1.4109 -0.6367 ...
#  $ df2.h_3: num [1:4] -0.826 -1.506 -1.551 0.862
#  $ df3.h_2: num [1:12] 0.8948 0.0305 0.9131 -0.0219 0.2252 ...

We can use the pmap function from the package. 我们可以使用包中的pmap函数。 The key is to define a function to take arguments based on the k and column name, and then organize a list with these arguments, and then use pmap . 关键是定义一个函数来根据k和列名称获取参数,然后用这些参数组织一个列表,然后使用pmap

library(tidyverse)

# Define a function 
subset_fun <- function(dat, k1, k2, col){
  dat2 <- dat %>%
    filter(k <= k1, k > k2) %>%
    pull(col)
  return(dat2)
}

# Define lists for the function arguments
par <- list(dat = df_list,                   # List of data frames
            k1 = list(12, 4, 12),            # The first number 
            k2 = list(0, 0, 0),              # The second number
            col = list("h_1", "h_3", "h_2")) # The column name

# Apply the subset_fun
df_list2 <- pmap(par, subset_fun)
df_list2
# $df1
# [1] -0.6868529 -0.4456620  1.2240818  0.3598138  0.4007715  0.1106827 -0.5558411  1.7869131
# [9]  0.4978505 -1.9666172  0.7013559 -0.4727914
# 
# $df2
# [1] -0.9474746 -0.4905574 -0.2560922  1.8438620
# 
# $df3
# [1] -0.2803953  0.5629895 -0.3724388  0.9769734 -0.3745809  1.0527115 -1.0491770 -1.2601552
# [9]  3.2410399 -0.4168576  0.2982276  0.6365697

DATA 数据

set.seed(123)

df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)

Consider Map , the wrapper to mapply to return a list of dataframes. 考虑Mapmapply的包装器返回数据帧列表 And because you subset one column, to avoid return as a vector, cast back with data.frame and use setNames to rename. 并且因为您将一列列为子集,以避免作为向量返回, data.frame使用data.frame并使用setNames重命名。

Here, mapply or Map , sibling to lapply , is chosen because you want to iterate element-wise across a list of equal length objects. 在这里,选择mapplyMaplapply to lapply ,因为你想在一个等长对象列表中逐个元素地迭代。 Mapply takes an unlimited number of arguments, here being four, requiring lengths to be equal or multiples of lengths: Mapply使用无限数量的参数,这里是四个,要求长度相等或长度的倍数:

low_limits <- c(0, 0, 0)
high_limits <- c(12, 4, 12)
h_cols <- c("h_1", "h_2", "h_3")

subset_fct <- function(df, lo, hi, col)  
               setNames(data.frame(df[which(df$k > lo & df$k <= hi), col]), col)

new_df_list <- Map(subset_fct, df_list, low_limits, high_limits, h_cols)

# EQUIVALENT CALL
new_df_list <- mapply(subset_fct, df_list, low_limits, 
                      high_limits, h_cols, SIMPLIFY = FALSE)

Output (uses set.seed(456) at top to reproduce random numbers) 输出 (在顶部使用set.seed(456)来重现随机数)

new_df_list

# $df1
#           h_1
# 1   1.0073523
# 2   0.5732347
# 3  -0.9158105
# 4   1.3110974
# 5   0.9887263
# 6   1.6539287
# 7  -1.4408052
# 8   1.9473564
# 9   1.7369362
# 10  0.3874833
# 11  2.2800340
# 12  1.5378833

# $df2
#           h_2
# 1  0.11815133
# 2  0.86990262
# 3 -0.09193621
# 4  0.06889879

# $df3
#           h_3
# 1  -1.4122604
# 2  -0.9997605
# 3  -2.3107388
# 4   0.9386188
# 5  -1.3881885
# 6  -0.6116866
# 7   0.3184948
# 8  -0.2354058
# 9   1.0750520
# 10 -0.1007956
# 11  1.0701526
# 12  1.0358389

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM