简体   繁体   English

有限制地删除r中重复的观测值

[英]Removing duplicated observations in r with restriction

I have a dataset with which contains duplicates of the ident variable. 我有一个数据集,其中包含ident变量的重复项。 I need to select only 1 observation of each ident and it needs to be the newest value, ie the resulting data should contain the observation for the ident where the 'year' is the highest in the initial data set. 我只需要为每个标识选择1个观测值,并且它必须是最新值,即结果数据应包含对标识的观测值,其中“年”在初始数据集中最高。

I believe a general case would look like this: 我相信一般情况如下:

1. ident   value   year
 2. A       1       19X1
 3. A       2       19X2
 4. B       4       19X2
 5. B       2       19X1
 6. B       1       19X3
 7. C       1       19X4
 8. C       2       19X1

(I could not order it in a proper table here, so please disregard the numbered list on the left) (我无法在此处在适当的表中订购它,因此请忽略左侧的编号列表)

Only, I have several hundred thousands obs. 只有,我有数十万个观察员。

Order of the resulting data set is not important to me. 所得数据集的顺序对我而言并不重要。

Using library dplyr you can do something like this: 使用库dplyr可以执行以下操作:

library(dplyr)
df %>% group_by(ident) %>% arrange(desc(year)) %>% slice(1)

Output will be as follows: 输出如下:

Source: local data frame [3 x 4]
Groups: ident [3]

    X1. ident value  year
  (dbl) (chr) (int) (chr)
1     3     A     2  19X2
2     6     B     1  19X3
3     7     C     1  19X4

This assumes year is in a format where sorting in descending order makes it go from latest to oldest. 假设year采用的格式是降序排列,则从最新到最旧。

NOTE: that x1. 注意:x1。 column is a result of your input data above. 列是您上面输入数据的结果。 I just read it as is. 我只是按原样阅读。

Try 尝试

df <- do.call(rbind, lapply(split(df, df$ident), 
                            function(x) x[which.max(x$year), ]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM