[英]Find unique entries in otherwise identical rows
I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.我目前正在尝试找到一种方法来查找数据集中其他重复行中的唯一列值。
My dataset has the following properties:我的数据集具有以下属性:
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct()
function.如下代码所示,如果所有变量的行确实相同,则可以使用
dplyr::distinct()
function 删除重复的行。 In my case, not all "duplicate" rows are removed by distinct()
, which can only mean that not all entries are identical.就我而言,并非所有“重复”行都被
distinct()
删除,这只能意味着并非所有条目都是相同的。
I want to find a way to identify which entries are unique in these otherwise duplicate rows.我想找到一种方法来识别在这些否则重复的行中哪些条目是唯一的。
Example:例子:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct()
will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).在上面的示例中,我演示了
dplyr::distinct()
将如何删除完全重复的行(ID = 2),但不会删除某些列中不同的行(ID = 1 和 3 的行,以及 x2、x3 和 x5 列) .
What I want is an overview over which columns that are not duplicates for each value:我想要的是概述哪些列对于每个值不重复:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
If you just want to keep the first instance of each identifier:如果您只想保留每个标识符的第一个实例:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output: Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary.我已经在这个问题上工作了一段时间,我找到了一个解决方案,尽管它比我需要的步骤更多。 I can only presume there's a more elegant solution out there.
我只能假设那里有一个更优雅的解决方案。 Anyway, this should work:
无论如何,这应该工作:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment随意评论
A data.table
alternative where data is melted to long format, filtered and cast back to wide: data.table
替代方案,其中数据被融合为长格式,过滤并转换回宽格式:
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:我认为比你的简单一点:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 1 月 18 日创建
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.