简体   繁体   English

如何删除R中的重复项

[英]How to remove duplicates in R

I have a very large data set, and it looks like this one below: df <- data.frame(school=c("a", "a", "a", "b","b","c","c","c"), year=c(3,3,1,4,2,4,3,1), GPA=c(4,4,4,3,3,3,2,2)) 我有一个非常大的数据集,它看起来如下所示: df <- data.frame(school=c("a", "a", "a", "b","b","c","c","c"), year=c(3,3,1,4,2,4,3,1), GPA=c(4,4,4,3,3,3,2,2))

school year GPA
  a    3   4
  a    3   4
  a    1   4
  b    4   3
  b    2   3
  c    4   3
  c    3   2
  c    1   2

and I want it to be look like: 我希望它看起来像:

school year GPA
 a    3   4
 a    3   4
 b    4   3
 c    4   3

So basically, what I want is for each given school, I want their top year student(students), regardless of the GPA. 所以基本上,我想要的是每个特定的学校,我想要他们的顶尖学生(学生),无论GPA如何。

I have tried: 我努力了:

new_df <- df[!duplicated(paste(df[,1],df[,2])),] but this gives me the unique combination between the school and year. new_df <- df[!duplicated(paste(df[,1],df[,2])),]但这给了我学校和学校之间的独特组合。

while the one below gives me the unique school new_df2 <- df[!duplicated(df$school),] 而下面的那个给了我独特的学校new_df2 <- df[!duplicated(df$school),]

Using the plyr library 使用plyr

require(plyr)
ddply(df,.(school),function(x){x[x$year==max(x$year),]})
> ddply(df,.(school),function(x){x[x$year==max(x$year),]})
  school year GPA
1      a    3   4
2      a    3   4
3      b    4   3
4      c    4   3

or base 或基地

test<-lapply(split(df,df$school),function(x){x[x$year==max(x$year),]})
out<-do.call(rbind,test)
> out
    school year GPA
a.1      a    3   4
a.2      a    3   4
b        b    4   3
c        c    4   3

Explanation: split splits the dataframe into a list by schools. 说明: split将数据框拆分为学校列表。

dat<-split(df,df$school)

> dat
$a
  school year GPA
1      a    3   4
2      a    3   4
3      a    1   4

$b
  school year GPA
4      b    4   3
5      b    2   3

$c
  school year GPA
6      c    4   3
7      c    3   2
8      c    1   2

for each school we want the members in the top year. 对于每所学校,我们希望成员在最佳年份。

dum.fun<-function(x){x[x$year==max(x$year),]}

> dum.fun(dat$a)
  school year GPA
1      a    3   4
2      a    3   4

lapply applies a function over the members of a list and outputs a list lapply将函数应用于列表成员并输出列表

> lapply(split(df,df$school),function(x){x[x$year==max(x$year),]})
$a
  school year GPA
1      a    3   4
2      a    3   4

$b
  school year GPA
4      b    4   3

$c
  school year GPA
6      c    4   3

this is what we want but in list form. 这是我们想要的,但是以列表形式。 We need to bind the members of the list together. 我们需要将列表的成员绑定在一起。 We do this by calling rbind on the members successively using do.call . 我们通过使用do.call连续调用成员上的rbind来完成此操作。

I'm a fan of the by statement (see ?by ) for this kind of thing. 对于这种事情,我是by声明(见?by )的粉丝。 df is split into groups on the basis of df$school and then the rows of each school which represent the max(year) are returned. dfdf$school的基础上分组,然后返回代表max(year)的每个学校的行。

> by(df,df$school,function(x) x[x$year==max(x$year),])
df$school: a
  school year GPA
1      a    3   4
2      a    3   4
------------------------------------------------------------ 
df$school: b
  school year GPA
4      b    4   3
------------------------------------------------------------ 
df$school: c
  school year GPA
6      c    4   3

do.call(rbind... just joins up the results for each school which are returned from the by statement. do.call(rbind...只是加入从by语句返回的每个学校的结果。

do.call(rbind,by(df,df$school,function(x) x[x$year==max(x$year),]))

    school year GPA
a.1      a    3   4
a.2      a    3   4
b        b    4   3
c        c    4   3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM