在数据帧列表上应用函数的有效方法

Question

I have a list of dataframes in R. What I need to do is apply a function to each dataframe, in this case removing special characters, and have returned a list of dataframes. 我在R中有一个数据帧列表。我需要做的是将函数应用于每个数据帧，在这种情况下删除特殊字符，并返回一个数据帧列表。

Using lapply and as.data.frame the following works fine and delivers exactly what I need: 使用lapply和as.data.frame ，以下工作正常，并提供我所需要的：

my_df =data.frame(names = seq(1,10), chars = c("abcabc!!", "abcabc234234!!"))
my_list = list(my_df, my_df, my_df)

#str(my_list)
List of 3
 $ :'data.frame':   10 obs. of  2 variables: ...

new_list <- lapply(my_list, function(y) as.data.frame(lapply(y, function(x) gsub("[^[:alnum:][:space:]']", "", x))))

# str(new_list)
List of 3
 $ :'data.frame':   10 obs. of  2 variables:
  ..$ names: Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
  ..$ chars: Factor w/ 2 levels "abcabc","abcabc234234": 1 2 1 2 1 2 1 2 1 2
 $ :'data.frame':   10 obs. of  2 variables:
  ..$ names: Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
  ..$ chars: Factor w/ 2 levels "abcabc","abcabc234234": 1 2 1 2 1 2 1 2 1 2
 $ :'data.frame':   10 obs. of  2 variables:
  ..$ names: Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
  ..$ chars: Factor w/ 2 levels "abcabc","abcabc234234": 1 2 1 2 1 2 1 2 1 2

But I am wondering if there is a more efficient way that doesn't require nested lapply . 但我想知道是否有一种更有效的方法，不需要嵌套lapply 。 Perhaps a different apply-family function that returns the elements as a dataframe? 也许是一个不同的apply-family函数，它将元素作为数据帧返回？

Answer 1

We don't need a nested lapply , just a single lapply with transform does it 我们不需要嵌套的lapply ，只需要一个带transform lapply就可以了

lapply(my_list, transform, chars = gsub("[^[:alnum:][:space:]']", "", chars))

The pattern can be made compact to "[^[[:alnum:] ']" 模式可以紧凑为"[^[[:alnum:] ']"

Answer 2

While @akrun is right that your second lapply call is useless in this example, I think it does not solve the general case where many columns might be relevant, and it is unknown which might be. 虽然@akrun是正确的，你的第二次lapply调用在这个例子中是无用的，但我认为它并没有解决许多列可能相关的一般情况，并且它可能是未知的。

What is inefficient here is the conversion back with as.data.frame , not the inner lapply call. 这里效率低下的是使用as.data.frame转换回来，而不是内部lapply调用。 The lapply call itself is almost just as fast as if you would apply the function to a single vector or a matrix of the same size. lapply调用本身几乎与将函数应用于单个向量或相同大小的矩阵一样快。

If you really want to be more time-efficient here, I would suggest using data.table . 如果你真的想在这里更节省时间，我建议使用data.table 。 I've made the example a bit larger so we can time it. 我已经做了一个更大的例子，所以我们可以计时。

library(data.table)

f <- function(x) gsub("[^[:alnum:][:space:]']", "", x)

my_df <- as.data.frame(matrix(paste0(sample(c(letters,'!'), size=1000000, replace=T),
                                 sample(c(letters,'!'), size=1000000, replace=T)), 
                                 ncol=250), stringsAsFactors = FALSE)
my_list = list(my_df, my_df, my_df)

system.time(lapply(my_list, function(y) as.data.frame(lapply(y, f))))
# 2.256 seconds

my_dt <- as.data.table(my_df)
my_list2 = list(my_dt, my_dt, my_dt)

system.time(lapply(my_list2, function(y) y[,lapply(.SD,f)]))
# 1.180 seconds

在数据帧列表上应用函数的有效方法

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-12-20 11:28:58

解决方案2
1 2016-12-20 12:22:08

在数据帧列表上应用函数的有效方法

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-12-20 11:28:58

解决方案2 1 2016-12-20 12:22:08

解决方案1
4 已采纳 2016-12-20 11:28:58

解决方案2
1 2016-12-20 12:22:08