简体   繁体   English

从 R 的特定列中删除单元格中的重复项?

[英]Remove duplicates in cells from specific columns in R?

在此处输入图像描述

I have a data frame which contains repeated characters or numbers.我有一个包含重复字符或数字的数据框。 I want to create a new df which only contains unique cells in each of these columns.我想创建一个新的 df ,它只包含这些列中的唯一单元格。 Below is a visual of what I am trying to achieve.下面是我想要实现的目标的视觉效果。 Any ideas would be highly appreciated.任何想法将不胜感激。

Here's a regex solution (based on mock data in the absence of reproducible data):这是一个正则表达式解决方案(基于没有可重现数据的模拟数据):

library(stringr)
df[,1:3] <- lapply(df[,1:3], function(x) str_extract_all(x, "(\\b\\w+\\b)(?!.*\\1)"))

The solution drwas on negative lookahead ( (?....) ) and backreference ( \\1 ): the pattern (\\b\\w+\\b)(?..*\\1) is used to str_extract_all alphanumeric strings unless they are repeated later in the string, which effectively captures all unique values:解决方案是负前瞻( (?....) )和反向引用( \\1 ):模式(\\b\\w+\\b)(?..*\\1)用于str_extract_all字母数字字符串,除非它们稍后在字符串中重复,这有效地捕获了所有唯一值:

Result:结果:

df
               Title   Length    Prediction
1             George 555, 666           111
2 Alice, Peter, Kate 123, 444 333, 777, 222

Data:数据:

df <- data.frame(
  Title = c("George,George,George", "Kate,Alice,Kate,Peter,Kate"),
  Length = c("555,555,666", "123,123,444,123,444"), 
  Prediction = c("111,111,111", "222,333,222,777,222"), stringsAsFactors = F)

Do like this.这样做。 Using the df created by ChrisRuehlemann使用 ChrisRuehlemann创建df

library(tidyverse)
df %>% mutate(across(everything(), ~str_split(., ",")),
              across(everything(), ~map(., ~unique(.x))))
               Title   Length    Prediction
1             George 555, 666           111
2 Kate, Alice, Peter 123, 444 222, 333, 777

Or one-liner或单线

mutate(df, across(everything(), ~map(str_split(., ","), ~unique(.x))))

               Title   Length    Prediction
1             George 555, 666           111
2 Kate, Alice, Peter 123, 444 222, 333, 777

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM