[英]Remove duplicates in cells from specific columns in R?
I have a data frame which contains repeated characters or numbers.我有一个包含重复字符或数字的数据框。 I want to create a new df which only contains unique cells in each of these columns.
我想创建一个新的 df ,它只包含这些列中的唯一单元格。 Below is a visual of what I am trying to achieve.
下面是我想要实现的目标的视觉效果。 Any ideas would be highly appreciated.
任何想法将不胜感激。
Here's a regex solution (based on mock data in the absence of reproducible data):这是一个正则表达式解决方案(基于没有可重现数据的模拟数据):
library(stringr)
df[,1:3] <- lapply(df[,1:3], function(x) str_extract_all(x, "(\\b\\w+\\b)(?!.*\\1)"))
The solution drwas on negative lookahead ( (?....)
) and backreference ( \\1
): the pattern (\\b\\w+\\b)(?..*\\1)
is used to str_extract_all
alphanumeric strings unless they are repeated later in the string, which effectively captures all unique values:解决方案是负前瞻(
(?....)
)和反向引用( \\1
):模式(\\b\\w+\\b)(?..*\\1)
用于str_extract_all
字母数字字符串,除非它们稍后在字符串中重复,这有效地捕获了所有唯一值:
Result:结果:
df
Title Length Prediction
1 George 555, 666 111
2 Alice, Peter, Kate 123, 444 333, 777, 222
Data:数据:
df <- data.frame(
Title = c("George,George,George", "Kate,Alice,Kate,Peter,Kate"),
Length = c("555,555,666", "123,123,444,123,444"),
Prediction = c("111,111,111", "222,333,222,777,222"), stringsAsFactors = F)
Do like this.这样做。 Using the
df
created by ChrisRuehlemann使用 ChrisRuehlemann创建的
df
library(tidyverse)
df %>% mutate(across(everything(), ~str_split(., ",")),
across(everything(), ~map(., ~unique(.x))))
Title Length Prediction
1 George 555, 666 111
2 Kate, Alice, Peter 123, 444 222, 333, 777
Or one-liner或单线
mutate(df, across(everything(), ~map(str_split(., ","), ~unique(.x))))
Title Length Prediction
1 George 555, 666 111
2 Kate, Alice, Peter 123, 444 222, 333, 777
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.