I am (newbie) using R and regular Regular expression to write the regex for manipulating strings in a data.frame
column. My data look like this in R:
c1
Peter Parker
Hawk & Dove
J Jonah Jameson
3JPX spo
Bruce Wayne
What I am trying to get is 2nd column "c2" that consists of the following strings:
c2
PeterP
Hawk&D
JJJ
3JPXs
BruceW
Basically I want the entire first word of the string (regardless of length) and the first alphanumeric element of every word after. I have not been able to find any function or logic for this. Is it possible to do so with regex?
Thanks in Advance
Here is a base R approach using gsub
:
x <- c("Peter Parker", "Hawk & Dove", "J Jonah Jameson", "3JPX spo", "Bruce Wayne")
output <- gsub("\\s+(\\S)\\S*(?!\\S)", "\\1", x, perl=TRUE)
output
[1] "PeterP" "Hawk&D" "JJJ" "3JPXs" "BruceW"
The regex pattern \\s+(\\S)\\S*(?!\\S)
matches one or more space characters, then matches and captures the first character of the name component. It also consumes the remainder of the name component, replacing with only the captured first character.
In case the above still be unclear to you, here is how the regex pattern works, step by step:
\s+ match one or more space characters
(\S) then match AND capture the first character of the name-word
\S* match the remainder of the name-word
(?!\S) assert that what follows the end of the name-word is either a space
or the end of the string
The replacement in the call to gsub
is just \\1
, which is the first and only capture group, corresponding to the first letter of each name, beyond the very first name.
Though not particularly a regex solution but a different approach could be to get bring data in long format by separating each word, get first word as it is and take only first character from remaining of the words and paste them.
library(dplyr)
df %>%
group_by(row = row_number()) %>%
tidyr::separate_rows(c1, sep = "\\s+") %>%
summarise(c2 = paste0(first(c1) , paste0(substr(c1[-1], 1, 1), collapse = "")),
c1 = paste(c1, collapse = " ")) %>%
select(c1, c2, -row)
# c1 c2
# <chr> <chr>
#1 Peter Parker PeterP
#2 Hawk & Dove Hawk&D
#3 J Jonah Jameson JJJ
#4 3JPX spo 3JPXs
#5 Bruce Wayne BruceW
data
df <- structure(list(c1 = c("Peter Parker", "Hawk & Dove", "J Jonah Jameson",
"3JPX spo", "Bruce Wayne")), row.names = c(NA, -5L), class = "data.frame")
The development version of unglue features a multiple
argument, which can be a function to apply to identically named matches (here we'd want to concatenate them with paste0()
). In our case we want to match the full first word, then the first character of all sequences separated by space, and we have either 1 or 2 of such sequences following the first word:
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
patterns <- c(
"{c2} {c2=\\S}{=\\S*} {c2=\\S}{=\\S*}",
"{c2} {c2=\\S}{=\\S*}")
unglue_data(df$c1, patterns, multiple = paste0)
#> c2
#> 1 PeterP
#> 2 Hawk&D
#> 3 JJJ
#> 4 3JPXs
#> 5 BruceW
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.