[英]How to identify mirrored duplicates of rows in R
In the following SO post How to identify partial duplicates of rows in R , I asked how to get rid of partially duplicated rows. 在下面的SO帖子中如何识别R中的部分重复行,我问如何去掉部分重复的行。 Here's what I asked:
这是我问的问题:
I would like to identify "partial" matches of rows in a dataframe. 我想识别数据帧中行的“部分”匹配。 Specifically, I want to create a new column with a value of 1 if a particular row in a dataframe has duplicate row somewhere else in the dataframe based on a match between a subset of columns.
具体来说,如果数据框中的特定行根据列子集之间的匹配在数据框中的其他位置具有重复行,则我想创建值为1的新列。 An added complexity is that one of the columns in the dataframe is numeric and I want to match if the absolute values match.
增加的复杂性是数据框中的一列是数字的,如果绝对值匹配,我想匹配。
The issue is that I need to make sure that when a row is identified as partially duplicated, it is so ONLY if ONE of the columns that's part of the match is the mirror opposite value and not just a match on an absolute value. 问题是我需要确保当一行被识别为部分重复时,如果匹配的一部分列是镜像相反值而不仅仅是绝对值匹配,那么它是唯一的。 To make things clearer, here's the sample data from the previous post:
为了使事情更清楚,以下是上一篇文章中的示例数据:
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California")
num<-c("-258", "123", "42", "258")
date<-c("day 2", "day 15", "day 3","day 45")
(df<-as.data.frame(cbind(name,state,num, date)))
name state num date
1 Richard Nixon California -258 day 2
2 Bill Clinton Indiana 123 day 15
3 George Bush Florida 42 day 3
4 Richard Nixon California 258 day 45
Here was the solution to my previous post: 这是我上一篇文章的解决方案:
df$absnum = abs(as.numeric(as.character(df$num)))
df$newcol = duplicated(df[,c('name','state', 'absnum')]) |
duplicated(df[,c('name','state', 'absnum')], fromLast = T)
# name state num date absnum newcol
# 1 Richard Nixon California -258 day 2 258 TRUE
# 2 Bill Clinton Indiana 123 day 15 123 FALSE
# 3 George Bush Florida 42 day 3 42 FALSE
# 4 Richard Nixon California 258 day 45 258 TRUE
Note that row 1 and row 4 are labeled TRUE
under newcol
, which is fine. 请注意,第1行和第4
newcol
下标记为TRUE
,这很好。 And here is new sample data with the added complexity issue: 这是新的样本数据,增加了复杂性问题:
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill
Clinton")
state<-c("California", "Indiana", "Florida", "California", "Indiana")
num<-c("-258", "123", "42", "258", "123")
date<-c("day 2", "day 15", "day 3","day 45", "day 100")
(df<-as.data.frame(cbind(name,state,num, date)))
name state num date
1 Richard Nixon California -258 day 2
2 Bill Clinton Indiana 123 day 15
3 George Bush Florida 42 day 3
4 Richard Nixon California 258 day 45
5 Bill Clinton Indiana 123 day 100
Note that observations 2 and 5 are partial duplicates but not in the same way as 1 and 4. I need to apply TRUE
only to those observations in which their absolute values match BUT NOT their original values. 请注意,观察2和5是部分重复,但与1和4不同。我需要将
TRUE
仅应用于其绝对值与其原始值匹配的观察值。 So I want the result to return the following: 所以我希望结果返回以下内容:
name state num date newcol
1 Richard Nixon California -258 day 2 TRUE
2 Bill Clinton Indiana 123 day 15 FALSE
3 George Bush Florida 42 day 3 FALSE
4 Richard Nixon California 258 day 45 TRUE
5 Bill Clinton Indiana 123 day 100 FALSE
The solution provided by the previous SO post would apply TRUE
to rows 2 and 5 when I only would like this applied to rows 1 and 4. 当我只想将这个应用于第1行和第4行时,前一个SO帖子提供的解决方案将
TRUE
应用于第2行和第5行。
In base R, you can use the same duplicated
test as your linked question on 'partial' duplicates, but then exclude values that are the same 在基础R中,您可以在“部分”重复项上使用与链接问题相同的
duplicated
测试,但随后排除相同的值
df$numnum = as.numeric(as.character(df$num))
df$absnum = abs(df$numnum)
df$newcol = (duplicated(df[,c('name','state', 'absnum')]) |
duplicated(df[,c('name','state', 'absnum')], fromLast = T)) &
!(duplicated(df$numnum) | duplicated(df$numnum, fromLast = T))
# name state num date numnum absnum newcol
# 1 Richard Nixon California -258 day 2 -258 258 TRUE
# 2 Bill Clinton Indiana 123 day 15 123 123 FALSE
# 3 George Bush Florida 42 day 3 42 42 FALSE
# 4 Richard Nixon California 258 day 45 258 258 TRUE
# 5 Bill Clinton Indiana 123 day 100 123 123 FALSE
One option would be to convert the 'num' to numeric
type first, create another column with abs
olute values ('num1'), grouped by 'name', 'state', 'num1', mutate
to create the bool column by checking the number of rows equal to 2 ( n() == 2
) and the number of distinct sign
of 'num' greater than 1 一种选择是首先将'num'转换为
numeric
类型,创建另一个具有abs
('num1')的列,按'name','state','num1', mutate
分组以通过检查创建bool列等于2的行数( n() == 2
)和'num'的不同sign
的数量大于1
library(tidyverse)
df %>%
mutate(num = as.numeric(num), num1 = abs(num)) %>%
group_by(name, state, num1) %>%
mutate(newcol = n() == 2 & n_distinct(sign(num)) > 1) %>%
ungroup %>%
select(-num1)
# A tibble: 5 x 5
# name state num date newcol
# <chr> <chr> <dbl> <chr> <lgl>
#1 Richard Nixon California -258 day 2 TRUE
#2 Bill Clinton Indiana 123 day 15 FALSE
#3 George Bush Florida 42 day 3 FALSE
#4 Richard Nixon California 258 day 45 TRUE
#5 Bill Clinton Indiana 123 day 100 FALSE
NOTE: cbind
creates a matrix
and matrix
can have only single type. 注意:
cbind
创建一个matrix
, matrix
只能有一个类型。 Therefore, if there is any character column or element, the whole matrix becomes character
class. 因此,如果存在任何字符列或元素,则整个矩阵变为
character
类。 Wrapping it with data.frame
, propagates that and can convert to factor
( stringsAsFactors = TRUE
- by default) or character
(if we change it to FALSE
) 用
data.frame
包装它,传播它并可以转换为factor
( stringsAsFactors = TRUE
- 默认情况下)或character
(如果我们将其更改为FALSE
)
df <- data.frame(name, state, num, date, stringsAsFactors = FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.