繁体   English   中英

R:如何删除字符向量中的重复元素

[英]R: how to remove repeated elements in a character vector

s <- "height(female), weight, BMI, and BMI."

在上面的字符串中,单词BMI重复两次。 我希望字符串是:

"height (female), weight, and BMI."

我已尝试以下方法将字符串分解为独特的部分:

> unique(strsplit(s, " ")[[1]])
[1] "height"      "(female),"   "weight,"    "BMI," "and"         "BMI."

但自“BMI”和“BMI”以来。 是不一样的字符串,使用unique不会摆脱其中之一。

编辑:我怎样才能移动重复的短语? (即体重指数而不是BMI)

s <- "height (female), weight, weight, body mass index, body mass index." 
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2") 
> stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")
[1] "height (female), weight, body mass index, body mass index."

首先使用这样的正则表达式替换不需要的重复项可能会有所帮助:

(?<=,|^)([()\w\s]+),\s(.*?)((?: and)?(?=\1))

演示

说明

  • (?<=, |^)\\b前边界。 \\b应该工作,但没有正确锚定)
  • ([()\\w\\s]+),块元素
  • \\s(.*?)((?: and)?中间的一切
  • (?=\\1))重复元素

代码示例:

#install.packages("stringr")
library(stringr)
s <- "height(female), weight, BMI, and BMI."
stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")

输出:

[1] "height(female), weight, and BMI."

关于括号中的零件分离,请使用另一个替换:

stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

输出:

[1] "height (female), weight, and BMI."

测试并将事物放在一起:

s <- c("height(female), weight, BMI, and BMI."
       ,"height(female), weight, whatever it is, and whatever it is."
       ,"height(female), weight, age, height(female), and BMI."
       ,"weight, weight.")
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")
stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

输出:

[1] "height (female), weight, and BMI."      "height (female), weight, and whatever it is."
[3] "weight, age, height (female), and BMI." "weight."    

你可以尝试这个正则表达式:

(\b\w+\b)[^\w\r\n]+(?=.*\1)

并用空字符串替换每个匹配项

单击“演示”

检查Ruby代码

输入

height(female), weight, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, and BMI.
height(female), weight, BMI, age, and BMI.

产量

height(female), weight, and BMI.
height(female), weight, age, and BMI.

说明:

  • (\\b\\w+\\b) - 匹配由单词边界包围的单词字符的1 +次出现并在组1中捕获它
  • [^\\w\\r\\n]+ - 匹配任何既不是单词也不是换行符的字符的1 +次出现。 所以,这将匹配,. 或空格。
  • (?=.*\\1) - 正向前瞻以验证组1中匹配的内容必须在字符串的后面再次出现。 只有在那种情况下才会进行更换。

注意 :这将保留重复单词的最后一次出现。

或者,如果重复的单词也包含空格,则可以使用(\\b[^,]+)[, ]+(?=.*\\1)

library(stringr)

s <- "height(female), weight, BMI, and BMI, and more even more BMI."
pieces <- unlist(str_split(s, "\\b"))
non_word <- !grepl("\\w", pieces)

# if you want to keep just the last instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = TRUE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, ,  , and  even more BMI."

# if you want to keep just the first instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = FALSE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, BMI, and ,  more even  ."

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM