简体   繁体   English

有条件地删除R中的重复项(20,000个观察值)

[英]Conditionally removing duplicates in R (20K observations)

I am currently working in a large data set looking at duplicate water rights. 我目前正在处理大量的水权,研究重复的水权。 Each right holder is assigned an RightID, but some were recorded twice for clerical purposes. 每个权利持有者都分配有一个RightID,但是出于文书目的,其中一些记录了两次。 However, some rightIDs are listed more than once and do have relevance to my end goal. 但是,有些rightID被多次列出,并且确实与我的最终目标有关。 One example: there are double entries when a metal tag number was assigned to a specific water right. 一个示例:将金属标签号分配给特定的水权时,会有两次输入。 To avoid double counting the critical information I need to delete an observation. 为了避免重复计算关键信息,我需要删除一个观察值。

I have this written at the moment, 我现在写的是这个

#Updated Metal Tag Number
for(i in 1:nrow(duplicate.rights)) {
  if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){
    remove(i)
  }
  print[i]
}

The original data frame is set up similarly: 原始数据帧的设置类似:

RightID    Source        Use           MetalTagNu
1-0000     Wolf Creek    Irrigation    N/A
1-0000     Wolf Creek    Irrigation    12345
1-0001     Bear River    Domestic      N/A
1-0002     Beaver Stream Domestic      00001
1-0002     Beaver Stream Irrigation    00001

Eg right holder 1-0002 is necessary to keep because he is using his water right for two different purposes. 例如,必须保留权利持有人1-0002,因为他将自己的水权用于两种不同目的。 However, right holder 1-0000 is unnecessary a repeat. 但是,权利持有者1-0000不必要重复。

Right holder 1-0000 I need to eliminate but right holder 1-0002 is valuable to my end goal. 权利持有人1-0000我需要消除,但权利持有人1-0002对于我的最终目标很有价值。 I should also note that there can be up to 10 entries for a single rightID but out of those 10 only 1 is an unnecessary duplicate. 我还要注意,单个rightID最多可以有10个条目,但是在这10个条目中,只有1个是不必要的重复。 Also, the duplicate and original entry will not be next to each other in the dataset. 同样,重复项和原始项在数据集中不会相邻。

I am quite the novice so please forgive my poor previous attempt. 我是新手,所以请原谅我以前的可怜尝试。 I know i can use the l apply function to make this go faster and more efficiently. 我知道我可以使用l apply函数使此过程更快,更有效。 Any guidance there would be much appreciated. 任何指导将不胜感激。

So I would suggest the following: 因此,我建议以下几点:

1) You say that you want to keep some duplicates (metal tag number was assigned to a specific water right). 1)您说要保留一些重复项(金属标签号已分配给特定的水权)。 I don't know what this means. 我不知道这是什么意思。 But I assume that it is something like this - if metal tag number = 1 then even if there are duplicates, you want to keep them. 但是我假设是这样的-如果金属标签号= 1,那么即使有重复,也要保留它们。 So I propose that you take these rows in your data (let's call this data ) out: 因此,我建议您删除数据中的这些行(我们称此data ):

data_to_keep <- data[data$metal_tag_number == 1, ]
data_to_dedupe <- data[data$metal_tag_number != 1, ]    

2) Now that you have the two dataframes, you can dedupe the dataframe data_to_dedupe with no problem: 2)现在有了两个数据帧,您可以data_to_dedupe对数据帧data_to_dedupe进行重复数据删除了:

deduped_data = data_to_dedupe[!duplicated(data_to_dedupe$dedupe_key), ]

3) Now you can merge the two dataframes back together: 3)现在您可以将两个数据框合并回去:

final_data <- rbind(data_to_keep, deduped_data)

If this is what you wanted please up-mark and suggest that the answer is correct. 如果这是您想要的,请加注并建议答案正确。 Thanks! 谢谢!

Create a new column,key, which is a combination of RightID & Use. 创建一个新的列,键,它是RightID和Use的组合。

Assuming your dataframe is called df, 假设您的数据框称为df,

df$key <- paste(df$RightID,df$Use) df $ key <-paste(df $ RightID,df $ Use)

Then, remove duplicates using this command : 然后,使用以下命令删除重复项:

df1 <- df[!duplicated(df[,1],)] df1 <-df [!duplicated(df [,1],)]

df1 will have no duplicates. df1将没有重复项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM