简体   繁体   English

正则表达式提取所有后续单词的第一个单词+第一个字符

[英]Regular expression to extract first word + first character of all following words

I am (newbie) using R and regular Regular expression to write the regex for manipulating strings in a data.frame column.我是(新手)使用 R 和正则表达式编写正则表达式来操作data.frame列中的字符串。 My data look like this in R:我的数据在 R 中是这样的:

c1                       
Peter Parker            
Hawk & Dove             
J Jonah Jameson         
3JPX spo                
Bruce Wayne              

What I am trying to get is 2nd column "c2" that consists of the following strings:我想要得到的是第二列“c2”,它由以下字符串组成:

c2
PeterP
Hawk&D
JJJ
3JPXs
BruceW

Basically I want the entire first word of the string (regardless of length) and the first alphanumeric element of every word after.基本上我想要字符串的整个第一个单词(无论长度如何)以及后面每个单词的第一个字母数字元素。 I have not been able to find any function or logic for this.我无法为此找到任何功能或逻辑。 Is it possible to do so with regex?用正则表达式可以这样做吗?

Thanks in Advance提前致谢

Here is a base R approach using gsub :这是使用gsub的基本 R 方法:

x <- c("Peter Parker", "Hawk & Dove", "J Jonah Jameson", "3JPX spo", "Bruce Wayne")
output <- gsub("\\s+(\\S)\\S*(?!\\S)", "\\1", x, perl=TRUE)
output

[1] "PeterP" "Hawk&D" "JJJ"    "3JPXs"  "BruceW"

The regex pattern \\s+(\\S)\\S*(?!\\S) matches one or more space characters, then matches and captures the first character of the name component.正则表达式\\s+(\\S)\\S*(?!\\S)匹配一个或多个空格字符,然后匹配并捕获名称组件的第一个字符。 It also consumes the remainder of the name component, replacing with only the captured first character.它还使用名称组件的其余部分,仅替换为捕获的第一个字符。

In case the above still be unclear to you, here is how the regex pattern works, step by step:如果您对上述内容仍然不清楚,下面是正则表达式模式的工作原理,一步一步:

\s+    match one or more space characters
(\S)   then match AND capture the first character of the name-word
\S*    match the remainder of the name-word
(?!\S) assert that what follows the end of the name-word is either a space
       or the end of the string

The replacement in the call to gsub is just \\1 , which is the first and only capture group, corresponding to the first letter of each name, beyond the very first name.gsub的调用中的替换只是\\1 ,它是第一个也是唯一的捕获组,对应于每个名字的第一个字母,超出了第一个名字。

Though not particularly a regex solution but a different approach could be to get bring data in long format by separating each word, get first word as it is and take only first character from remaining of the words and paste them.虽然不是特别的正则表达式解决方案,但另一种不同的方法可能是通过分隔每个单词以长格式获取数据,按原样获取第一个单词并仅从剩余的单词中获取第一个字符并粘贴它们。

library(dplyr)

df %>%
  group_by(row = row_number()) %>%
  tidyr::separate_rows(c1, sep = "\\s+") %>%
  summarise(c2 = paste0(first(c1) , paste0(substr(c1[-1], 1, 1), collapse = "")),
            c1 = paste(c1, collapse = " ")) %>%
  select(c1, c2, -row)

#   c1              c2    
#  <chr>           <chr> 
#1 Peter Parker    PeterP
#2 Hawk & Dove     Hawk&D
#3 J Jonah Jameson JJJ   
#4 3JPX spo        3JPXs 
#5 Bruce Wayne     BruceW

data数据

df <- structure(list(c1 = c("Peter Parker", "Hawk & Dove", "J Jonah Jameson", 
"3JPX spo", "Bruce Wayne")), row.names = c(NA, -5L), class = "data.frame")

The development version of unglue features a multiple argument, which can be a function to apply to identically named matches (here we'd want to concatenate them with paste0() ). unglue的开发版本具有multiple参数,它可以是一个应用于同名匹配的函数(这里我们想将它们与paste0()连接起来)。 In our case we want to match the full first word, then the first character of all sequences separated by space, and we have either 1 or 2 of such sequences following the first word:在我们的例子中,我们想要匹配完整的第一个单词,然后匹配所有序列中由空格分隔的第一个字符,并且在第一个单词之后有 1 个或 2 个这样的序列:

# remotes::install_github("moodymudskipper/unglue")
library(unglue)
patterns <- c(
  "{c2} {c2=\\S}{=\\S*} {c2=\\S}{=\\S*}",
  "{c2} {c2=\\S}{=\\S*}")

unglue_data(df$c1, patterns, multiple = paste0)
#>       c2
#> 1 PeterP
#> 2 Hawk&D
#> 3    JJJ
#> 4  3JPXs
#> 5 BruceW  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM