有限制地删除r中重复的观测值

Question

I have a dataset with which contains duplicates of the ident variable. 我有一个数据集，其中包含ident变量的重复项。 I need to select only 1 observation of each ident and it needs to be the newest value, ie the resulting data should contain the observation for the ident where the 'year' is the highest in the initial data set. 我只需要为每个标识选择1个观测值，并且它必须是最新值，即结果数据应包含对标识的观测值，其中“年”在初始数据集中最高。

I believe a general case would look like this: 我相信一般情况如下：

1. ident   value   year
 2. A       1       19X1
 3. A       2       19X2
 4. B       4       19X2
 5. B       2       19X1
 6. B       1       19X3
 7. C       1       19X4
 8. C       2       19X1

(I could not order it in a proper table here, so please disregard the numbered list on the left) （我无法在此处在适当的表中订购它，因此请忽略左侧的编号列表）

Only, I have several hundred thousands obs. 只有，我有数十万个观察员。

Order of the resulting data set is not important to me. 所得数据集的顺序对我而言并不重要。

Answer 1

Using library dplyr you can do something like this: 使用库dplyr可以执行以下操作：

library(dplyr)
df %>% group_by(ident) %>% arrange(desc(year)) %>% slice(1)

Output will be as follows: 输出如下：

Source: local data frame [3 x 4]
Groups: ident [3]

    X1. ident value  year
  (dbl) (chr) (int) (chr)
1     3     A     2  19X2
2     6     B     1  19X3
3     7     C     1  19X4

This assumes year is in a format where sorting in descending order makes it go from latest to oldest. 假设year采用的格式是降序排列，则从最新到最旧。

NOTE: that x1. 注意：x1。 column is a result of your input data above. 列是您上面输入数据的结果。 I just read it as is. 我只是按原样阅读。

Answer 2

Try 尝试

df <- do.call(rbind, lapply(split(df, df$ident), 
                            function(x) x[which.max(x$year), ]))

有限制地删除r中重复的观测值

问题描述

2 个解决方案

解决方案1
1 2016-04-29 13:08:25

解决方案2
0 2016-04-29 20:37:46

有限制地删除r中重复的观测值

问题描述

2 个解决方案

解决方案1 1 2016-04-29 13:08:25

解决方案2 0 2016-04-29 20:37:46

解决方案1
1 2016-04-29 13:08:25

解决方案2
0 2016-04-29 20:37:46