简体   繁体   English

如何识别R中的镜像重复行

[英]How to identify mirrored duplicates of rows in R

In the following SO post How to identify partial duplicates of rows in R , I asked how to get rid of partially duplicated rows. 在下面的SO帖子中如何识别R中的部分重复行,我问如何去掉部分重复的行。 Here's what I asked: 这是我问的问题:

I would like to identify "partial" matches of rows in a dataframe. 我想识别数据帧中行的“部分”匹配。 Specifically, I want to create a new column with a value of 1 if a particular row in a dataframe has duplicate row somewhere else in the dataframe based on a match between a subset of columns. 具体来说,如果数据框中的特定行根据列子集之间的匹配在数据框中的其他位置具有重复行,则我想创建值为1的新列。 An added complexity is that one of the columns in the dataframe is numeric and I want to match if the absolute values match. 增加的复杂性是数据框中的一列是数字的,如果绝对值匹配,我想匹配。

The issue is that I need to make sure that when a row is identified as partially duplicated, it is so ONLY if ONE of the columns that's part of the match is the mirror opposite value and not just a match on an absolute value. 问题是我需要确保当一行被识别为部分重复时,如果匹配的一部分列是镜像相反值而不仅仅是绝对值匹配,那么它是唯一的。 To make things clearer, here's the sample data from the previous post: 为了使事情更清楚,以下是上一篇文章中的示例数据:

name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California")
num<-c("-258", "123", "42", "258")
date<-c("day 2", "day 15", "day 3","day 45")
(df<-as.data.frame(cbind(name,state,num, date)))
           name      state  num   date
1 Richard Nixon California -258  day 2
2  Bill Clinton    Indiana  123 day 15
3   George Bush    Florida   42  day 3
4 Richard Nixon California  258 day 45 

Here was the solution to my previous post: 这是我上一篇文章的解决方案:

df$absnum = abs(as.numeric(as.character(df$num)))
df$newcol = duplicated(df[,c('name','state', 'absnum')]) | 
  duplicated(df[,c('name','state', 'absnum')], fromLast = T)

#            name      state  num   date absnum newcol
# 1 Richard Nixon California -258  day 2    258   TRUE
# 2  Bill Clinton    Indiana  123 day 15    123  FALSE
# 3   George Bush    Florida   42  day 3     42  FALSE
# 4 Richard Nixon California  258 day 45    258   TRUE

Note that row 1 and row 4 are labeled TRUE under newcol , which is fine. 请注意,第1行和第4 newcol下标记为TRUE ,这很好。 And here is new sample data with the added complexity issue: 这是新的样本数据,增加了复杂性问题:

name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill 
Clinton")
state<-c("California", "Indiana", "Florida", "California", "Indiana")
num<-c("-258", "123", "42", "258", "123")
date<-c("day 2", "day 15", "day 3","day 45", "day 100")
(df<-as.data.frame(cbind(name,state,num, date)))

  name           state      num   date
1 Richard Nixon  California -258  day 2
2 Bill Clinton   Indiana    123   day 15
3 George Bush    Florida    42    day 3
4 Richard Nixon  California 258   day 45
5 Bill Clinton   Indiana    123   day 100

Note that observations 2 and 5 are partial duplicates but not in the same way as 1 and 4. I need to apply TRUE only to those observations in which their absolute values match BUT NOT their original values. 请注意,观察2和5是部分重复,但与1和4不同。我需要将TRUE仅应用于其绝对值与其原始值匹配的观察值。 So I want the result to return the following: 所以我希望结果返回以下内容:

  name           state      num   date    newcol
1 Richard Nixon  California -258  day 2   TRUE
2 Bill Clinton   Indiana    123   day 15  FALSE
3 George Bush    Florida    42    day 3   FALSE
4 Richard Nixon  California 258   day 45  TRUE
5 Bill Clinton   Indiana    123   day 100 FALSE

The solution provided by the previous SO post would apply TRUE to rows 2 and 5 when I only would like this applied to rows 1 and 4. 当我只想将这个应用于第1行和第4行时,前一个SO帖子提供的解决方案将TRUE应用于第2行和第5行。

In base R, you can use the same duplicated test as your linked question on 'partial' duplicates, but then exclude values that are the same 在基础R中,您可以在“部分”重复项上使用与链接问题相同的duplicated测试,但随后排除相同的值

df$numnum = as.numeric(as.character(df$num))
df$absnum = abs(df$numnum)
df$newcol = (duplicated(df[,c('name','state', 'absnum')]) | 
  duplicated(df[,c('name','state', 'absnum')], fromLast = T)) &
  !(duplicated(df$numnum) | duplicated(df$numnum, fromLast = T))
#            name      state  num    date numnum absnum newcol
# 1 Richard Nixon California -258   day 2   -258    258   TRUE
# 2  Bill Clinton    Indiana  123  day 15    123    123  FALSE
# 3   George Bush    Florida   42   day 3     42     42  FALSE
# 4 Richard Nixon California  258  day 45    258    258   TRUE
# 5  Bill Clinton    Indiana  123 day 100    123    123  FALSE

One option would be to convert the 'num' to numeric type first, create another column with abs olute values ('num1'), grouped by 'name', 'state', 'num1', mutate to create the bool column by checking the number of rows equal to 2 ( n() == 2 ) and the number of distinct sign of 'num' greater than 1 一种选择是首先将'num'转换为numeric类型,创建另一个具有abs ('num1')的列,按'name','state','num1', mutate分组以通过检查创建bool列等于2的行数( n() == 2 )和'num'的不同sign的数量大于1

library(tidyverse)
df %>%
    mutate(num = as.numeric(num), num1 = abs(num)) %>% 
    group_by(name, state, num1) %>% 
    mutate(newcol = n() == 2 & n_distinct(sign(num)) > 1) %>%
    ungroup %>% 
    select(-num1)
# A tibble: 5 x 5
#  name          state        num date    newcol 
#  <chr>         <chr>      <dbl> <chr>   <lgl>
#1 Richard Nixon California  -258 day 2   TRUE 
#2 Bill Clinton  Indiana      123 day 15  FALSE
#3 George Bush   Florida       42 day 3   FALSE
#4 Richard Nixon California   258 day 45  TRUE 
#5 Bill Clinton  Indiana      123 day 100 FALSE

NOTE: cbind creates a matrix and matrix can have only single type. 注意: cbind创建一个matrixmatrix只能有一个类型。 Therefore, if there is any character column or element, the whole matrix becomes character class. 因此,如果存在任何字符列或元素,则整个矩阵变为character类。 Wrapping it with data.frame , propagates that and can convert to factor ( stringsAsFactors = TRUE - by default) or character (if we change it to FALSE ) data.frame包装它,传播它并可以转换为factorstringsAsFactors = TRUE - 默认情况下)或character (如果我们将其更改为FALSE

data 数据

df <- data.frame(name, state, num, date, stringsAsFactors = FALSE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM