简体   繁体   English

在其他相同的行中查找唯一条目

[英]Find unique entries in otherwise identical rows

I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.我目前正在尝试找到一种方法来查找数据集中其他重复行中的唯一列值。

My dataset has the following properties:我的数据集具有以下属性:

  • The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - x n ).数据集的列包含一个标识符变量 (ID) 和大量响应变量 (x1 - x n )。
  • Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).每行应该代表一个人,这意味着 ID 列中的值都应该是唯一的(并且不重复)。
  • Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - x n ).有些行是重复的,在 ID 列中有重复的条目和看似相同的响应项值 (x1 - x n )。 However, the dataset is too large to get a full overview over all variables.但是,数据集太大,无法全面了解所有变量。

As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function.如下代码所示,如果所有变量的行确实相同,则可以使用dplyr::distinct() function 删除重复的行。 In my case, not all "duplicate" rows are removed by distinct() , which can only mean that not all entries are identical.就我而言,并非所有“重复”行都被distinct()删除,这只能意味着并非所有条目都是相同的。

I want to find a way to identify which entries are unique in these otherwise duplicate rows.我想找到一种方法来识别在这些否则重复的行中哪些条目是唯一的。

Example:例子:

library(dplyr)
library(janitor)

df <- data.frame(
    "ID" = rep(1:3, each = 2),
    "x1" = rep(4:6, each = 2),
    "x2" = c("a", "a", "b", "b", "c", "d"),
    "x3" = c(7, 10, 8, 8, 9, 11),
    "x4" = rep(letters[4:6], each = 2),
    "x5" = c("x", "p", "y", "y", "z", "q"),
    "x6" = rep(letters[7:9], each = 2)
)

# The dataframe with all entries
df

A data.frame: 6 × 7
ID  x1  x2  x3  x4  x5  x6
1   4   a   7   d   x   g
1   4   a   10  d   p   g
2   5   b   8   e   y   h
2   5   b   8   e   y   h
3   6   c   9   f   z   i
3   6   d   11  f   q   i


# The dataframe
df %>% 
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)

ID  dupe_count  x1  x2  x3  x4  x5  x6
1   2           4   a   7   d   x   g
1   2           4   a   10  d   p   g
3   2           6   c   9   f   z   i
3   2           6   d   11  f   q   i

In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).在上面的示例中,我演示了dplyr::distinct()将如何删除完全重复的行(ID = 2),但不会删除某些列中不同的行(ID = 1 和 3 的行,以及 x2、x3 和 x5 列) .

What I want is an overview over which columns that are not duplicates for each value:我想要的是概述哪些列对于每个值不重复:

df %>% 
distinct() %>%
janitor::get_dupes(ID) %>% 
# Here I want a way to find columns with unidentical entries:
find_nomatch()

ID x2 x3 x5
 1     7  x
 1    10  p
 3  c  9  z
 3  d 11  q

If you just want to keep the first instance of each identifier:如果您只想保留每个标识符的第一个实例:

df <- data.frame(
    "ID" = rep(1:3, each = 2),
    "x1" = rep(4:6, each = 2),
    "x2" = rep(letters[1:3], each = 2),
    "x3" = c(7, 10, 8, 8, 9, 11),
    "x4" = rep(letters[4:6], each = 2)
)

df  %>% 
    distinct(ID, .keep_all = TRUE)

Output: Output:

  ID x1 x2 x3 x4
1  1  4  a  7  d
2  2  5  b  8  e
3  3  6  c  9  f

I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary.我已经在这个问题上工作了一段时间,我找到了一个解决方案,尽管它比我需要的步骤更多。 I can only presume there's a more elegant solution out there.我只能假设那里有一个更优雅的解决方案。 Anyway, this should work:无论如何,这应该工作:

df <- df %>%
distinct() %>%
janitor::get_dupes(ID) 

# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()

# Lapply on each ID
df <- lapply(
    l, 
    function(x) {
        # Filter rows for the duplicated ID
        dplyr::filter(df, ID == x) %>%
        # Transpose dataframe (converts it into a matrix) 
        t() %>%
        # Convert back to data frame
        as.data.frame() %>%
        # Filter columns that are not identical
        dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
        # Transpose back
        t() %>%
        # Convert back to data frame
        as.data.frame()
    }
) %>% 
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)

#Remove row names (not necessary)
row.names(df) <- NULL

df

A data.frame: 4 × 3
x2  x3  x5
NA  7   x
NA  10  p
c   9   z
d   11  q

Feel free to comment随意评论

A data.table alternative where data is melted to long format, filtered and cast back to wide: data.table替代方案,其中数据被融合为长格式,过滤并转换回宽格式:

library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
#    ID ID_1   x2 x3 x5
# 1:  1    1 <NA>  7  x
# 2:  1    2 <NA> 10  p
# 3:  3    1    c  9  z
# 4:  3    2    d 11  q

A bit more simple than yours I think:我认为比你的简单一点:

library(dplyr)
library(janitor)

df <- data.frame(
    "ID" = rep(1:3, each = 2),
    "x1" = rep(4:6, each = 2),
    "x2" = c("a", "a", "b", "b", "c", "d"),
    "x3" = c(7, 10, 8, 8, 9, 11),
    "x4" = rep(letters[4:6], each = 2),
    "x5" = c("x", "p", "y", "y", "z", "q"),
    "x6" = rep(letters[7:9], each = 2)
)

d <- df %>% 
  distinct() %>% 
  janitor::get_dupes(ID) 

d %>% 
  group_by(ID) %>% 
  # Check for each id which row elements are different from the of the first
  group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>% 
  do.call(what = cbind) %>% # Bind results for all ids 
  apply(1, any) %>% # return true if there are differences anywhere
  c(T, .) %>% # Keep id column 
  `[`(d, .)
#>   ID x2 x3 x5
#> 1  1  a  7  x
#> 2  1  a 10  p
#> 3  3  c  9  z
#> 4  3  d 11  q

Created on 2022-01-18 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 1 月 18 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM