简体   繁体   English

R:根据文件名列从数据框中删除行

[英]R: Remove rows from data frame based on file name column

I've got a large csv file I'm reading into a data frame which is itself a combination of csv's. 我有一个大的csv文件,我正在读取一个数据帧,它本身就是csv的组合。 The first column in the data frame is the file name. 数据框中的第一列是文件名。 The file name always ends with a 5 digit number and ".csv" The number of occurrences of each file name will vary. 文件名始终以5位数字和“ .csv”结尾。每个文件名的出现次数会有所不同。 Ex: 例如:

Source File
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00002.csv
xxx_00002.csv
xxx_00002.csv
xxx_00002.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
...

How would I go about removing the rows associated with the last n occurrences of the file name? 我将如何删除与文件名的最后n次出现相关的行? (Say, the last 2?) I'd like to end up with: (例如,最后2个?)我想说的是:

Source File
xxx_00001.csv
xxx_00001.csv
xxx_00001.csv
xxx_00002.csv
xxx_00002.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
xxx_00003.csv
...

Using dplyr : 使用dplyr

 library(dplyr)
 n_to_remove <- 2
 filtered <- group_by(df, SourceFile) %>% slice(1:(n()-n_to_remove))

group_by will make sure that the slice operation occurs for each group separately. group_by将确保对每个组分别进行切片操作。 n() is also a function from dplyr that will return the number of rows inside the group. n()也是dplyr的函数,它将返回组内的行数。 Note that this will fail if the number of rows of one of the CSVs are smaller than n_to_remove . 请注意,如果其中一个CSV的行数小于n_to_remove ,则此操作将失败。

We can use ave from base R 我们可以从base R使用ave

n <- 2
df1[with(df1, !ave(seq_along(Source_File), Source_File, 
             FUN = function(x) x %in% tail(x,n))), , drop=FALSE]
#     Source_File
#1  xxx_00001.csv
#2  xxx_00001.csv
#3  xxx_00001.csv
#6  xxx_00002.csv
#7  xxx_00002.csv
#10 xxx_00003.csv
#11 xxx_00003.csv
#12 xxx_00003.csv
#13 xxx_00003.csv

Or with data.table 或与data.table

library(data.table)
setDT(df1, keep.rownames=TRUE)[, head(.SD, -n) ,.(Source_File)][, rn:=NULL][]
#     Source_File
#1: xxx_00001.csv
#2: xxx_00001.csv
#3: xxx_00001.csv
#4: xxx_00002.csv
#5: xxx_00002.csv
#6: xxx_00003.csv
#7: xxx_00003.csv
#8: xxx_00003.csv
#9: xxx_00003.csv

data 数据

df1 <- structure(list(Source_File = c("xxx_00001.csv", "xxx_00001.csv", 
"xxx_00001.csv", "xxx_00001.csv", "xxx_00001.csv", "xxx_00002.csv", 
"xxx_00002.csv", "xxx_00002.csv", "xxx_00002.csv", "xxx_00003.csv", 
"xxx_00003.csv", "xxx_00003.csv", "xxx_00003.csv", "xxx_00003.csv", 
"xxx_00003.csv")), .Names = "Source_File", class = "data.frame", 
row.names = c(NA, -15L))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R如何根据列的第一个字符删除数据框中的行 - R how to remove rows in a data frame based on the first character of a column 根据POSIXct列的多个条件,从R data.frame中删除行 - Remove rows from R data.frame based on multiple conditions for POSIXct column 如何基于R中不同数据框中的列值从一个数据框中删除行? - How to remove rows from one dataframe based on the column values in a different data frame in R? R:根据多列中的值从数据框中删除行 - R: Remove rows from data frame based on values in several columns R:根据来自另一个数据框的匹配行更新列 - R: Update column based on matching rows from another data frame 从R中的fread()中删除数据框中的第一列名称 - Remove the first column name in a data frame from fread() in R 当列与R中的其他数据框列匹配时,如何从数据框中删除行 - How to remove rows from a data frame when the column matches with a different data frame column in R R:根据列表 2 的数据帧 [j] 的索引值从列表 1 的数据帧 [i] 中删除行 - R: Remove Rows From Data Frame [i] of List 1 based on Index values from Data Frame [j] of List 2 根据组和另一个数据帧删除R数据帧中的行 - remove rows in R data frame based on group and another data frame R:从一个数据框中提取行,基于列名匹配另一个数据框中的值 - R: Extract Rows from One Data Frame, Based on Column Names Matching Values from Another Data Frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM