简体   繁体   English

R:for循环解决方案,用于从多个数据帧中删除列

[英]R: for-loop solution to deleting columns from multiple data frames

My question is probably quite simple but I think my code could definitely be improved. 我的问题可能很简单,但我认为我的代码肯定可以改进。 Right now it's two for-loops but I'm sure there's a way to do what I need in a single loop, for the life of me I can't see what it is. 现在是两个for循环,但是我敢肯定有一种方法可以在一个循环中完成我需要的操作,因为我一生都看不到它是什么。

Having searched Stack, I found this excellent answer from Ananda where he was able to extract and keep columns within a range using lapply and for-loop methods. 搜索了Stack之后,我从Ananda找到了一个很好的答案,他可以使用lapply和for-loop方法提取列并将列保持在一定范围内。 The structure of my data gets in the way, however, as I want to be able to pick specific columns to delete. 但是,由于我希望能够选择要删除的特定列,因此数据的结构受到影响。 My data structure looks like this: 我的数据结构如下所示:

1   AAAT_1  1   GROUP   ****    1   -13.70  0
2   AAAT_2  51  GROUP   ****    1   -9.21   0
3   AAAT_3  101 GROUP   ****    1   -7.60   0
4   AAAT_4  151 GROUP   ****    1   -6.28   0

It's extract from some docking software and the only columns I want to keep are 2 (eg AAAT_1) and 7 (eg -13.70). 它是从某些对接软件中提取的,我要保留的唯一列是2(例如AAAT_1)和7(例如-13.70)。 The code I've used to do it, two for-loops: 我曾经用过的代码是两个for循环:

for (i in 1:length(temp)) {
  assign(temp[i], get(temp[i])[2:7])
}

....to keep the data from columns 2-7, followed by: ....以保留第2-7列中的数据,然后是:

for (i in 1:length(temp)) {
  assign(temp[i], get(temp[i])[-2:-5])
}

....to delete the rest of the columns I didn't need, where temp[i] is just a list of data frames the loop is acting on. ....删除其余不需要的列,其中temp [i]只是循环所作用的数据帧的列表。

So, as you can see, it's just two loops doing similar actions. 因此,如您所见,只有两个循环在执行类似的动作。 Surely there's a way to be able to pick specific columns to keep/delete and do it all in one loop/lapply statement? 当然,有一种方法可以选择要保留/删除的特定列,并在一个loop / lapply语句中完成所有操作? Trying things like [2,7] in the get statement doesn't work, appears to keep only column 7 and turns each data frame into 'Values' instead. 尝试在get语句中进行[2,7]之类的操作不起作用,似乎只保留第7列,而是将每个数据帧转换为“值”。 I'm not sure what's going so any insight there would be wonderful but, either way, if anyone can turn this two-loop solution into one, would be really appreciated. 我不确定会发生什么,因此任何见解都将是美好的,但是,无论哪种方式,如果有人可以将这种两回路解决方案变成一个,那将非常感激。 Definitely feel like I'm missing something really simple/obvious. 绝对感觉到我缺少真正简单/明显的东西。

Cheers. 干杯。

EDIT: Have taken into account the vectorised solutions from below to do the following instead. 编辑:考虑了下面的向量化解决方案,而不是执行以下操作。 The names of raw imported data start with stuff like F0001, F0002, etc. hence the pattern to make the initial list . 原始导入数据的名称以F0001,F0002等开头,因此是构成初始list的模式。

lst <- mget(ls(pattern='^F\\d+')) 

lst <- lapply(lst, "[", TRUE, c("V2","V7") )

lst <- lapply(seq_along(lst), 
             function(i,x) {assign(paste0(temp[i]),x[[i]], envir=.GlobalEnv)},
             x=lst)

I know loops get a bad rap in R, was a natural solution to me as a CPP programmer but meh, this was far quicker. 我知道循环在R中表现不好,作为CPP程序员,这对我来说是很自然的解决方案,但是,这要快得多。 Initially, the only downside from the other example was that the assign command pasted a letter to each of the created tables in sequence 1,2,3,....,n when the list of raw imported data files weren't entirely in numerical order (ie 1,2,3,5,6,10,...etc.) so this didn't preserve that order. 最初, 另一个示例的唯一缺点是,当原始导入的数据文件列表不完全位于列表中时, assign命令将字母按顺序1、2、3,...,n粘贴到每个已创建的表中数字顺序(即1,2,3,5,6,10等),因此这并没有保留该顺序。 So I had to use a list of the files (our old friend temp ) to name them correctly. 因此,我不得不使用文件列表(我们的老朋友temp )正确地命名它们。 Minor thing and the code isn't much shorter than two loops but it's most certainly faster. 小事情和代码并不比两个循环短很多,但是肯定更快。

So, in short, the above three lines add all the imported raw data to a list, keep only the columns I need then split the list up into separate dataframes whilst preserving the correct names. 因此,简而言之,以上三行将所有导入的原始数据添加到列表中,仅保留我需要的列,然后将列表拆分为单独的数据框,同时保留正确的名称。 Cheers for the help! 为帮助加油!

If you have a data frame, you index rows and columns with 如果有数据框,则使用

data.frame[row, column]

So, data.frame[2,7]) will give you the value of the 2nd row in the 7th column. 因此, data.frame[2,7])将为您提供第七列中第二行的值。 I guess you were looking for 我想你在找

temp <- temp[, c(2,7)]

or, if temp is a list of data frames 或者,如果temp是数据帧列表

temp <- lapply(temp, function(x) x[, c(2,7)])

So, if you want to use a vector of numbers as column- or row-indices, create this vector with c(...) . 因此,如果要将数字向量用作列或行索引,请使用c(...)创建此向量。 If I understand your example right, you don't need any loop-command, if you use lapply . 如果我正确理解了您的示例,那么如果您使用lapply ,则不需要任何循环命令。

A for loop? 一个for循环? Maybe I'am missing something but just why do not use the solution proposed by @Daniel or a dplyr approach like this. 也许我错过了一些东西,但是为什么不使用@Daniel提出的解决方案或像这样的dplyr方法。

data
  V1     V2  V3    V4   V5 V6     V7 V8
1  1 AAAT_1   1 GROUP ****  1 -13.70  0
2  2 AAAT_2  51 GROUP ****  1  -9.21  0
3  3 AAAT_3 101 GROUP ****  1  -7.60  0
4  4 AAAT_4 151 GROUP ****  1  -6.28  0

and here the code: 这里的代码:

library(dplyr)
data <- select(data, V2, V7)
data
      V2     V7
1 AAAT_1 -13.70
2 AAAT_2  -9.21
3 AAAT_3  -7.60
4 AAAT_4  -6.28

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM