简体   繁体   English

从R中的数据框中删除重复项

[英]Removing Duplicates From a Dataframe in R

My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. 我的情况是,我试图清理学生结果的数据集以进行处理,但是由于只想查看“初次尝试”,我遇到了一些完全删除重复项的问题,但是有些学生多次上这门课。 An example of the data using one of the duplicates is: 使用重复项之一的数据示例为:

        id     period                                           desc
632   1507       1101 90714 Research a contemporary biological issue
633   1507       1101         6317 Explain the process of speciation
634   1507       1101                  8931 Describe gene expression
14448 1507       1201                  8931 Describe gene expression
14449 1507       1201         6317 Explain the process of speciation
14450 1507       1201 90714 Research a contemporary biological issue
25884 1507       1301         6317 Explain the process of speciation
25885 1507       1301                  8931 Describe gene expression
25886 1507       1301 90714 Research a contemporary biological issue

The first 2 digits of reg_period are the year they sat the paper. reg_period前两位数字是他们坐在纸上的年份。 As can be seen, I would want to be keeping where id is 1507 and reg_period is 1101. So far, an example of my code to get the values I want to be trimming is: 可以看出,我想保留id为1507且reg_period为1101的位置。到目前为止,获取我要修剪的值的代码示例如下:

unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])

However, there are a couple of problems I am then running in to. 但是,我遇到了两个问题。 This only works because the data is ordered by id and reg_period and this isn't guaranteed in future. 这仅起作用,因为数据是按idreg_period排序的,并且将来无法保证。 Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because %in% doesn't seem to work with it and a loop with rbind runs out of memory. 另外,我不知道如何获取重复条目的列表,然后选择其中不包含的行,因为%in%似乎无法使用它,并且rbind循环耗尽了内存。

What's the best way to handle this? 处理此问题的最佳方法是什么?

I would probably use dplyr . 我可能会使用dplyr Calling your data df : 调用数据df

result = df %>% group_by(id) %>%
    filter(period == min(period))

If you prefer base , I would pull the id / period combinations to keep into a separate data frame and then do an inner join with the original data: 如果您喜欢base ,我将把id / period组合拉到一个单独的数据框中,然后对原始数据进行内部联接:

id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)

Try this, it works for me with your data: 试试这个,它对我的​​数据有用:

dd <- read.csv("a.csv", colClasses=c("numeric","numeric","character"), header=TRUE)
print (dd)
dd <- dd[order(dd$id, dd$period), ]
dd <- dd[!duplicated(dd[, c("id","period")]), ]
print (dd)  

Output: 输出:

    id period                                           desc
1 1507   1101 90714 Research a contemporary biological issue
4 1507   1201                  8931 Describe gene expression
7 1507   1301         6317 Explain the process of speciation

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM