[英]R: Remove rows from data frame based on file name column
I've got a large csv file I'm reading into a data frame which is itself a combination of csv's. 我有一个大的csv文件,我正在读取一个数据帧,它本身就是csv的组合。 The first column in the data frame is the file name.
数据框中的第一列是文件名。 The file name always ends with a 5 digit number and ".csv" The number of occurrences of each file name will vary.
文件名始终以5位数字和“ .csv”结尾。每个文件名的出现次数会有所不同。 Ex:
例如:
Source File
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00002.csv
xxx_00002.csv
xxx_00002.csv
xxx_00002.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
...
How would I go about removing the rows associated with the last n occurrences of the file name? 我将如何删除与文件名的最后n次出现相关的行? (Say, the last 2?) I'd like to end up with:
(例如,最后2个?)我想说的是:
Source File
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00002.csv
xxx_00002.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
...
Using dplyr
: 使用
dplyr
:
library(dplyr)
n_to_remove <- 2
filtered <- group_by(df, SourceFile) %>% slice(1:(n()-n_to_remove))
group_by
will make sure that the slice operation occurs for each group separately. group_by
将确保对每个组分别进行切片操作。 n()
is also a function from dplyr
that will return the number of rows inside the group. n()
也是dplyr
的函数,它将返回组内的行数。 Note that this will fail if the number of rows of one of the CSVs are smaller than n_to_remove
. 请注意,如果其中一个CSV的行数小于
n_to_remove
,则此操作将失败。
We can use ave
from base R
我们可以从
base R
使用ave
n <- 2
df1[with(df1, !ave(seq_along(Source_File), Source_File,
FUN = function(x) x %in% tail(x,n))), , drop=FALSE]
# Source_File
#1 xxx_00001.csv
#2 xxx_00001.csv
#3 xxx_00001.csv
#6 xxx_00002.csv
#7 xxx_00002.csv
#10 xxx_00003.csv
#11 xxx_00003.csv
#12 xxx_00003.csv
#13 xxx_00003.csv
Or with data.table
或与
data.table
library(data.table)
setDT(df1, keep.rownames=TRUE)[, head(.SD, -n) ,.(Source_File)][, rn:=NULL][]
# Source_File
#1: xxx_00001.csv
#2: xxx_00001.csv
#3: xxx_00001.csv
#4: xxx_00002.csv
#5: xxx_00002.csv
#6: xxx_00003.csv
#7: xxx_00003.csv
#8: xxx_00003.csv
#9: xxx_00003.csv
df1 <- structure(list(Source_File = c("xxx_00001.csv", "xxx_00001.csv",
"xxx_00001.csv", "xxx_00001.csv", "xxx_00001.csv", "xxx_00002.csv",
"xxx_00002.csv", "xxx_00002.csv", "xxx_00002.csv", "xxx_00003.csv",
"xxx_00003.csv", "xxx_00003.csv", "xxx_00003.csv", "xxx_00003.csv",
"xxx_00003.csv")), .Names = "Source_File", class = "data.frame",
row.names = c(NA, -15L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.