从多个数据框（.tab）中删除特定列，然后将它们合并到R中

Question

我在一个名为file1.tab，file2.tab，..... file24.tab的文件夹中有24个“ .tab”文件。 每个文件都是一个具有4列和50,000行的数据帧：该文件看起来像附加的图像-

第一列在所有24个文件中都是相同的，但是列2,3和4在24个文件中的每个文件中具有不同的值。 对我而言，每个数据框的第3列和第4列都无关紧要。 我可以通过以下步骤分别摆脱每个数据框中的列：

filenames <- Sys.gob("*.tab")  #reads all the 24 file names
dataframe1 <- read.tab(filenames[1]) 
dataframe1 <- dataframe1[, -c(3,4)] #removes 3rd and 4th column of dataframe

但是，当我必须对24个（或更多）相似的文件分别重复上述操作时，这变得非常忙碌。 有没有一种方法可以执行上述操作，即用一个代码从所有24个文件中删除第3列和第4列？

第二部分 ：

从24个文件中删除第3列和第4列后，我想创建一个具有25列的新数据框，这样第一列是Column1（在所有文件中都相同），随后的列是每个文件。

对于两个数据帧df1和df2，我使用：

merge(df1,df2,1,1)

并创建一个新的数据框。 对24个修改后的数据帧单独进行合并操作将非常繁琐。 请你帮助我好吗？

PS-我试图找到任何类似问题的答案（如果之前曾问过），但找不到。 因此，如果将其标记为重复项，请在已回答问题的地方放置一个链接，这将非常有益。 我刚刚开始学习R，并且没有任何经验。

此致Kshitij

Answer 1

首先让我们列出一个假文件列表

fakefile <- 'a\tb\tc\td
1\t2\t3\t4'

# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))


## 'data.frame':    1 obs. of  4 variables:
##  $ a: int 1
##  $ b: int 2
##  $ c: int 3
##  $ d: int 4

# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)


##  chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...

原则上，所有解决方案将具有与列表相同的基础工作，然后具有合并概念（尽管合并可能在此处和此处有所不同）。

解决方案1-如果可以依靠第1列的顺序

如果您可以依靠列的顺序，那么您实际上不需要读取每个文件的第1列和第4列，而只需读取第4列并绑定它们。

# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]

# Reading cols 4 in all files

# We first make a function that does our tasks (reading and removing cols)

reader_fun <- function(x) {
    read.table(text = x, header = TRUE)[,4] 
}

# Then we use lapply to use that function on each elment of our list

cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)


## List of 20
##  $ : int 4
##  $ : int 4
##  $ : int 4
##  $ : int 4


# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)

# And finally add column 1 to it
data.frame(col1, cols4_mat)

##   col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1    1  4  4  4  4  4  4  4  4  4   4   4   4   4   4   4   4   4   4   4
##   X20
## 1   4

解决方案2-如果您不能依靠订单

实施起来比较容易，但是在大多数情况下要慢得多

# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)

# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)

# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)


##   a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1   4   4     4     4       4       4         4         4           4
##   d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1           4             4             4               4               4
##   d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1                 4                 4                   4
##   d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1                   4                     4                     4


# you will need to modify the column names btw ...

奖金-最简洁的解决方案...

根据数据集的大小，您可能希望一开始就忽略多余的列（而不是先读取然后删除它们）。 您可以使用data.table包中的fread为您完成此操作。

reader_function <- function(x) {
    data.table::fread(x, select = c(1,4))
}

my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)

Answer 2

尽管上述塞巴斯蒂安的答案非常完美，但我本人却想出了另一种使用for循环解决上述问题的方法。 因此，我共享该解决方案，以防其他人有类似问题并且使用此方法感到自在。

首先，我将工作目录设置为包含文件的文件夹。 这是使用setwd（）命令完成的。

setwd("/absolute path to the folder containing files/") #set working directory to the folder containing files

现在，我定义文件的路径，以便可以列出文件。

path <- "/absolute path to the folder containing files/" #define the path to the folder

我创建了我感兴趣的文件名列表。

filenames<- dir(path, "*.tab") #List the files in the folder

现在，通过以下代码，使用第一个文件的第1列和第2列创建一个新文件

out_file<- read.table(filenames[1])[,c(1:2)] #create an output file with column1 and column2 of the first file

我编写了一个for循环，现在仅读取文件2到24的第二列，并将来自每个文件的第二列添加到上面定义的out_file中。

for(i in 2:length(filenames)){ #iterates from the second file as the first 2 columns of the first file has already been assigned to out_file
file<-read.table(filenames[i], header=FALSE, stringsAsFactors= FALSE) #reads files 
out_file<- cbind(out_file, file[,2]) #adds second column of each file
}

上面的代码实际执行的操作是遍历每个文件，提取列2并将其添加到out_file中，从而创建我感兴趣的文件。

从多个数据框（.tab）中删除特定列，然后将它们合并到R中

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-06-15 14:29:54

解决方案2
0 2018-06-25 11:49:29

从多个数据框（.tab）中删除特定列，然后将它们合并到R中

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-06-15 14:29:54

解决方案2 0 2018-06-25 11:49:29

解决方案1
0 已采纳 2018-06-15 14:29:54

解决方案2
0 2018-06-25 11:49:29