简体   繁体   中英

Removing Duplicates From a Dataframe in R

My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. An example of the data using one of the duplicates is:

        id     period                                           desc
632   1507       1101 90714 Research a contemporary biological issue
633   1507       1101         6317 Explain the process of speciation
634   1507       1101                  8931 Describe gene expression
14448 1507       1201                  8931 Describe gene expression
14449 1507       1201         6317 Explain the process of speciation
14450 1507       1201 90714 Research a contemporary biological issue
25884 1507       1301         6317 Explain the process of speciation
25885 1507       1301                  8931 Describe gene expression
25886 1507       1301 90714 Research a contemporary biological issue

The first 2 digits of reg_period are the year they sat the paper. As can be seen, I would want to be keeping where id is 1507 and reg_period is 1101. So far, an example of my code to get the values I want to be trimming is:

unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])

However, there are a couple of problems I am then running in to. This only works because the data is ordered by id and reg_period and this isn't guaranteed in future. Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because %in% doesn't seem to work with it and a loop with rbind runs out of memory.

What's the best way to handle this?

I would probably use dplyr . Calling your data df :

result = df %>% group_by(id) %>%
    filter(period == min(period))

If you prefer base , I would pull the id / period combinations to keep into a separate data frame and then do an inner join with the original data:

id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)

Try this, it works for me with your data:

dd <- read.csv("a.csv", colClasses=c("numeric","numeric","character"), header=TRUE)
print (dd)
dd <- dd[order(dd$id, dd$period), ]
dd <- dd[!duplicated(dd[, c("id","period")]), ]
print (dd)  

Output:

    id period                                           desc
1 1507   1101 90714 Research a contemporary biological issue
4 1507   1201                  8931 Describe gene expression
7 1507   1301         6317 Explain the process of speciation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM