简体   繁体   English

如何在 R 语言中查找具有连续字母的字符串中的单词

[英]How to find words in a string that have consecutive letters in R language

There is a problem that I do not know how to solve.有一个问题我不知道如何解决。

You need to write a function that returns all words from a string that contain repeated letters and the maximum number of their repetitions in a word.您需要编写一个 function 来返回字符串中包含重复字母的所有单词以及它们在单词中的最大重复次数。

Visually, this stage can be viewed with the following example: "hello good home aboba" after processing should be hello good , and the maximum number of repetitions of a character in a given string = 2 .从视觉上看,这个阶段可以用下面的例子来查看: "hello good home aboba"处理后应该是hello good ,并且给定字符串中一个字符的最大重复次数 = 2

The code I wrote from tries to find duplicate characters and based on this, extract words from a separate array, but something doesn't work.我写的代码试图找到重复的字符,并基于此,从一个单独的数组中提取单词,但有些东西不起作用。 Help solve the problem.帮助解决问题。

library(tidyverse)
library(stringr)   

text = 'tessst gfvdsvs bbbddsa daxz'
text = strsplit(text, ' ')
text

new = c()
new_2 = c()

for (i in text){
  
  new = str_extract_all(i, '([[:alpha:]])\\1+')
  if (new != character(0)){
    new_2 = c(new_2, i)
  }
}
new
new_2

Output: Output:

Error in if (new != character(0)) { : argument is of length zero
> new
[[1]]
[1] "sss"

[[2]]
character(0)

[[3]]
[1] "bbb" "dd" 

[[4]]
character(0)

> new_2
NULL
text = "hello good home aboba"

paste0(
  grep("(.)\\1{1,}", 
       unlist(strsplit(text, " ")), 
       value = TRUE),
  collapse = " ")

[1] "hello good"

You can use您可以使用

new <- unlist(str_extract_all(text, "\\p{L}*(\\p{L})\\1+\\p{L}*"))
i <- max(nchar( unlist(str_extract_all(new, "(.)\\1+")) ))

With str_extract_all(text, "\\p{L}*(\\p{L})\\1+\\p{L}*") you will extract all words containing at least two consecutive identical letters , and with max(nchar( unlist(str_extract_all(new, "(.)\\1+")) )) you will get the longest repeated letter chunk.使用str_extract_all(text, "\\p{L}*(\\p{L})\\1+\\p{L}*")您将提取包含至少两个连续相同字母的所有单词,并且max(nchar( unlist(str_extract_all(new, "(.)\\1+")) ))你会得到最长的重复字母块。

See the R demo online :在线查看 R 演示

library(stringr)
text <- 'tessst gfvdsvs bbbddsa daxz'
new <- unlist(str_extract_all(text, "\\p{L}*(\\p{L})\\1+\\p{L}*"))
# => [1] "tessst"  "bbbddsa"
i <- max(nchar( unlist(str_extract_all(new, "(.)\\1+")) ))
# => [1] 3

See this regex demo .请参阅此正则表达式演示 Regex details :正则表达式详细信息

  • \p{L}* - zero or more letters \p{L}* - 零个或多个字母
  • (\p{L}) - a letter captured into Group 1 (\p{L}) - 捕获到第 1 组的字母
  • \1+ - one or more repetitions of the captured letter \1+ - 一个或多个重复捕获的字母
  • \p{L}* - zero or more letters \p{L}* - 零个或多个字母

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM