如何识别R中行的部分重复

Question

I would like to identify "partial" matches of rows in a dataframe. 我想识别数据框中行的“部分”匹配。 Specifically, I want to create a new column with a value of 1 if a particular row in a dataframe has duplicate row somewhere else in the dataframe based on a match between a subset of columns. 具体来说，如果数据框中的特定行基于列的子集之间的匹配，则在数据框中的特定行在数据框中的其他地方有重复的行时，我想创建一个值为1的新列。 An added complexity is that one of the columns in the dataframe is numeric and I want to match if the absolute values match. 额外的复杂性是，数据框中的一列是数字，如果绝对值匹配，我想匹配。 Here is example data followed by an example of my desired output. 这是示例数据，后面是我所需输出的示例。

name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California")
num<-c("-258", "123", "42", "258")
date<-c("day 2", "day 15", "day 3","day 45")
(df<-as.data.frame(cbind(name,state,num, date)))
           name      state  num   date
1 Richard Nixon California -258  day 2
2  Bill Clinton    Indiana  123 day 15
3   George Bush    Florida   42  day 3
4 Richard Nixon California  258 day 45

What I'm hoping to acquire is the following dataframe: 我希望获得的是以下数据框：

           name      state  num   date newcol
1 Richard Nixon California -258  day 2 1
2  Bill Clinton    Indiana  123 day 15 0
3   George Bush    Florida   42  day 3 0
4 Richard Nixon California  258 day 45 1

Notice that rows 1 and 2 match along the name and state column and their absolute values match in the num column, resulting in a 1 in the added newcol column for both those rows, while the remaining rows have no such match and thus are valued at 0 . 请注意，第1行和第2行沿着name和state列匹配，并且它们的绝对值在num列中匹配，导致这两个行在添加的newcol列中都为1 ，而其余行没有这样的匹配，因此其值为0 。

I tried the following but to no avail: 我尝试了以下操作，但无济于事：

df$num<-as.numeric(df$num)
which(duplicated(df[c('name', 'state',abs('num'))]),)

Error in abs("num") : non-numeric argument to mathematical function

Of course that does not work because of the abs() 当然由于abs()而行不通

Answer 1

You can use 您可以使用

df$absnum = abs(as.numeric(as.character(df$num)))
df$newcol = duplicated(df[,c('name','state', 'absnum')]) | 
  duplicated(df[,c('name','state', 'absnum')], fromLast = T)

#            name      state  num   date absnum newcol
# 1 Richard Nixon California -258  day 2    258   TRUE
# 2  Bill Clinton    Indiana  123 day 15    123  FALSE
# 3   George Bush    Florida   42  day 3     42  FALSE
# 4 Richard Nixon California  258 day 45    258   TRUE

If you really need newcol to be 1 or 0 , then you can convert it to integer using as.integer . 如果确实需要newcol为1或0 ，则可以使用as.integer将其转换为整数。 But in most cases it is best to keep boolean flags as logical types. 但是在大多数情况下，最好将布尔标志保留为逻辑类型。

如何识别R中行的部分重复

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-02-13 01:35:49

如何识别R中行的部分重复

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-02-13 01:35:49

解决方案1
1 已采纳 2019-02-13 01:35:49