简体   繁体   中英

Regular expression to extract first word + first character of all following words

I am (newbie) using R and regular Regular expression to write the regex for manipulating strings in a data.frame column. My data look like this in R:

c1                       
Peter Parker            
Hawk & Dove             
J Jonah Jameson         
3JPX spo                
Bruce Wayne              

What I am trying to get is 2nd column "c2" that consists of the following strings:

c2
PeterP
Hawk&D
JJJ
3JPXs
BruceW

Basically I want the entire first word of the string (regardless of length) and the first alphanumeric element of every word after. I have not been able to find any function or logic for this. Is it possible to do so with regex?

Thanks in Advance

Here is a base R approach using gsub :

x <- c("Peter Parker", "Hawk & Dove", "J Jonah Jameson", "3JPX spo", "Bruce Wayne")
output <- gsub("\\s+(\\S)\\S*(?!\\S)", "\\1", x, perl=TRUE)
output

[1] "PeterP" "Hawk&D" "JJJ"    "3JPXs"  "BruceW"

The regex pattern \\s+(\\S)\\S*(?!\\S) matches one or more space characters, then matches and captures the first character of the name component. It also consumes the remainder of the name component, replacing with only the captured first character.

In case the above still be unclear to you, here is how the regex pattern works, step by step:

\s+    match one or more space characters
(\S)   then match AND capture the first character of the name-word
\S*    match the remainder of the name-word
(?!\S) assert that what follows the end of the name-word is either a space
       or the end of the string

The replacement in the call to gsub is just \\1 , which is the first and only capture group, corresponding to the first letter of each name, beyond the very first name.

Though not particularly a regex solution but a different approach could be to get bring data in long format by separating each word, get first word as it is and take only first character from remaining of the words and paste them.

library(dplyr)

df %>%
  group_by(row = row_number()) %>%
  tidyr::separate_rows(c1, sep = "\\s+") %>%
  summarise(c2 = paste0(first(c1) , paste0(substr(c1[-1], 1, 1), collapse = "")),
            c1 = paste(c1, collapse = " ")) %>%
  select(c1, c2, -row)

#   c1              c2    
#  <chr>           <chr> 
#1 Peter Parker    PeterP
#2 Hawk & Dove     Hawk&D
#3 J Jonah Jameson JJJ   
#4 3JPX spo        3JPXs 
#5 Bruce Wayne     BruceW

data

df <- structure(list(c1 = c("Peter Parker", "Hawk & Dove", "J Jonah Jameson", 
"3JPX spo", "Bruce Wayne")), row.names = c(NA, -5L), class = "data.frame")

The development version of unglue features a multiple argument, which can be a function to apply to identically named matches (here we'd want to concatenate them with paste0() ). In our case we want to match the full first word, then the first character of all sequences separated by space, and we have either 1 or 2 of such sequences following the first word:

# remotes::install_github("moodymudskipper/unglue")
library(unglue)
patterns <- c(
  "{c2} {c2=\\S}{=\\S*} {c2=\\S}{=\\S*}",
  "{c2} {c2=\\S}{=\\S*}")

unglue_data(df$c1, patterns, multiple = paste0)
#>       c2
#> 1 PeterP
#> 2 Hawk&D
#> 3    JJJ
#> 4  3JPXs
#> 5 BruceW  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM