[英]How to identify partial duplicates of rows in R
I would like to identify "partial" matches of rows in a dataframe. 我想识别数据框中行的“部分”匹配。 Specifically, I want to create a new column with a value of
1
if a particular row in a dataframe has duplicate row somewhere else in the dataframe based on a match between a subset of columns. 具体来说,如果数据框中的特定行基于列的子集之间的匹配,则在数据框中的特定行在数据框中的其他地方有重复的行时,我想创建一个值为
1
的新列。 An added complexity is that one of the columns in the dataframe is numeric and I want to match if the absolute values match. 额外的复杂性是,数据框中的一列是数字,如果绝对值匹配,我想匹配。 Here is example data followed by an example of my desired output.
这是示例数据,后面是我所需输出的示例。
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California")
num<-c("-258", "123", "42", "258")
date<-c("day 2", "day 15", "day 3","day 45")
(df<-as.data.frame(cbind(name,state,num, date)))
name state num date
1 Richard Nixon California -258 day 2
2 Bill Clinton Indiana 123 day 15
3 George Bush Florida 42 day 3
4 Richard Nixon California 258 day 45
What I'm hoping to acquire is the following dataframe: 我希望获得的是以下数据框:
name state num date newcol
1 Richard Nixon California -258 day 2 1
2 Bill Clinton Indiana 123 day 15 0
3 George Bush Florida 42 day 3 0
4 Richard Nixon California 258 day 45 1
Notice that rows 1 and 2 match along the name
and state
column and their absolute values match in the num
column, resulting in a 1
in the added newcol
column for both those rows, while the remaining rows have no such match and thus are valued at 0
. 请注意,第1行和第2行沿着
name
和state
列匹配,并且它们的绝对值在num
列中匹配,导致这两个行在添加的newcol
列中都为1
,而其余行没有这样的匹配,因此其值为0
。
I tried the following but to no avail: 我尝试了以下操作,但无济于事:
df$num<-as.numeric(df$num)
which(duplicated(df[c('name', 'state',abs('num'))]),)
Error in abs("num") : non-numeric argument to mathematical function
Of course that does not work because of the abs()
当然由于
abs()
而行不通
You can use 您可以使用
df$absnum = abs(as.numeric(as.character(df$num)))
df$newcol = duplicated(df[,c('name','state', 'absnum')]) |
duplicated(df[,c('name','state', 'absnum')], fromLast = T)
# name state num date absnum newcol
# 1 Richard Nixon California -258 day 2 258 TRUE
# 2 Bill Clinton Indiana 123 day 15 123 FALSE
# 3 George Bush Florida 42 day 3 42 FALSE
# 4 Richard Nixon California 258 day 45 258 TRUE
If you really need newcol
to be 1
or 0
, then you can convert it to integer using as.integer
. 如果确实需要
newcol
为1
或0
,则可以使用as.integer
将其转换为整数。 But in most cases it is best to keep boolean flags as logical types. 但是在大多数情况下,最好将布尔标志保留为逻辑类型。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.