[英]Remove consecutive duplicates from dataframe
I have a data frame that I want to remove duplicates that are consecutive (in base). 我有一个数据框,我想删除连续的重复(基础)。 I know
rle
may be helpful here but can't think of how to use it. 我知道
rle
在这里可能会有所帮助但却无法想到如何使用它。 The example output will help to illuminate what I'm asking for. 示例输出将有助于阐明我的要求。
Generate sample data: 生成样本数据:
set.seed(12)
samps <- sample(1:5, 20, T)
dat <- data.frame(v1=LETTERS[samps], v2=month.abb[samps])
dat[10, 2] <- "Mar"
Sample data: 样本数据:
v1 v2
1 A Jan
2 E May
3 E May
4 B Feb
5 A Jan
6 A Jan
7 A Jan
8 D Apr
9 A Jan
10 A Mar
11 B Feb
12 E May
13 B Feb
14 B Feb
15 B Feb
16 C Mar
17 C Mar
18 C Mar
19 D Apr
20 A Jan
Desired outcome: 期望的结果:
v1 v2
1 A Jan
3 E May
4 B Feb
7 A Jan
8 D Apr
10 A Mar
11 B Feb
12 E May
15 B Feb
18 C Mar
19 D Apr
20 A Jan
Here's a way, not with rle
, but a way none-the-less: 这是一种方式,不是
rle
,而是一种方式:
dat[with(dat, c(TRUE, diff(as.numeric(interaction(v1, v2))) != 0)), ]
This assumes you're using factor
columns, as your sample data implies. 这假设您正在使用
factor
列,正如您的样本数据所暗示的那样。
Here a fast solution using filter 这是使用过滤器的快速解决方案
dat[(filter(dat,c(-1,1))!= 0)[,1],]
v1 v2
1 A Jan
3 E May
4 B Feb
7 A Jan
8 D Apr
10 A Mar
11 B Feb
12 E May
15 B Feb
18 C Mar
19 D Apr
NA <NA> <NA>
You need to add the last value of the original data to the result. 您需要将原始数据的最后一个值添加到结果中。
Using rle
I came up with this 使用
rle
我想出了这个
ind <- cumsum(rle(as.character(dat$v1))$length)
dat[ind, ]
ind
indicates either the first or the last of consecutive entries. ind
表示连续条目的第一个或最后一个。
EDIT: 编辑:
A simple solution to Matthews comment would be 马修斯评论的一个简单解决方案就是
dat[15, 2] <- "May"
dat[cumsum(rle(paste0(dat$v1, dat$v2))$length), ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.