R：根据文件名列从数据框中删除行

Question

I've got a large csv file I'm reading into a data frame which is itself a combination of csv's. 我有一个大的csv文件，我正在读取一个数据帧，它本身就是csv的组合。 The first column in the data frame is the file name. 数据框中的第一列是文件名。 The file name always ends with a 5 digit number and ".csv" The number of occurrences of each file name will vary. 文件名始终以5位数字和“ .csv”结尾。每个文件名的出现次数会有所不同。 Ex: 例如：

Source File
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00002.csv
xxx_00002.csv
xxx_00002.csv
xxx_00002.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
...

How would I go about removing the rows associated with the last n occurrences of the file name? 我将如何删除与文件名的最后n次出现相关的行？ (Say, the last 2?) I'd like to end up with: （例如，最后2个？）我想说的是：

Source File
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00002.csv
xxx_00002.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
...

Answer 1

Using dplyr : 使用dplyr ：

 library(dplyr)
 n_to_remove <- 2
 filtered <- group_by(df, SourceFile) %>% slice(1:(n()-n_to_remove))

group_by will make sure that the slice operation occurs for each group separately. group_by将确保对每个组分别进行切片操作。 n() is also a function from dplyr that will return the number of rows inside the group. n()也是dplyr的函数，它将返回组内的行数。 Note that this will fail if the number of rows of one of the CSVs are smaller than n_to_remove . 请注意，如果其中一个CSV的行数小于n_to_remove ，则此操作将失败。

Answer 2

We can use ave from base R 我们可以从base R使用ave

n <- 2
df1[with(df1, !ave(seq_along(Source_File), Source_File, 
             FUN = function(x) x %in% tail(x,n))), , drop=FALSE]
#     Source_File
#1  xxx_00001.csv
#2  xxx_00001.csv
#3  xxx_00001.csv
#6  xxx_00002.csv
#7  xxx_00002.csv
#10 xxx_00003.csv
#11 xxx_00003.csv
#12 xxx_00003.csv
#13 xxx_00003.csv

Or with data.table 或与data.table

library(data.table)
setDT(df1, keep.rownames=TRUE)[, head(.SD, -n) ,.(Source_File)][, rn:=NULL][]
#     Source_File
#1: xxx_00001.csv
#2: xxx_00001.csv
#3: xxx_00001.csv
#4: xxx_00002.csv
#5: xxx_00002.csv
#6: xxx_00003.csv
#7: xxx_00003.csv
#8: xxx_00003.csv
#9: xxx_00003.csv

data 数据

df1 <- structure(list(Source_File = c("xxx_00001.csv", "xxx_00001.csv", 
"xxx_00001.csv", "xxx_00001.csv", "xxx_00001.csv", "xxx_00002.csv", 
"xxx_00002.csv", "xxx_00002.csv", "xxx_00002.csv", "xxx_00003.csv", 
"xxx_00003.csv", "xxx_00003.csv", "xxx_00003.csv", "xxx_00003.csv", 
"xxx_00003.csv")), .Names = "Source_File", class = "data.frame", 
row.names = c(NA, -15L))

R：根据文件名列从数据框中删除行

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-04-12 18:18:05

解决方案2
0 2016-04-13 02:18:43

data 数据

R：根据文件名列从数据框中删除行

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-04-12 18:18:05

解决方案2 0 2016-04-13 02:18:43

data 数据

解决方案1
1 已采纳 2016-04-12 18:18:05

解决方案2
0 2016-04-13 02:18:43