简体   繁体   中英

gsub R extracting string

I am trying to extract a string between two commas with gsub. If I have the following

xz<- "1620 Honeylocust Drive, 60210 IL, USA"

and I want to extract everything between the two commas, ( 60120 IL ), is it possible to use gsub?

I have tried

gsub(".*,","",xz)

The result is USA. How can I do it?

We can match zero or more characters that are not a , ( [^,]* ) followed by a , followed by zero or more space from the start ( ^ ) of the string or | a , followed by zero or more characters that are not a , ( [^,]* ) at the end ( $ ) of string and replace with blank ( "" )

gsub("^[^,]*,\\s*|,[^,]*$", "", xz)
#[1] "60210 IL"

Or another option is using sub and capture as a group

sub("^[^,]+,\\s+([^,]+).*", "\\1", xz)
#[1] "60210 IL"

Or another option is regexpr/regmatches

regmatches(xz, regexpr("(?<=,\\s)[^,]*(?=,)", xz, perl = TRUE))
#[1] "60210 IL"

Or with str_extract from stringr

library(stringr)
str_extract(xz, "(?<=,\\s)[^,]*(?=,)")
#[1] "60210 IL"

Update

With the new string,

xz1 <- "1620, Honeylocust Drive, 60210 IL, USA"
sub(".*,\\s+(+[0-9]+[^,]+).*", "\\1", xz1)
#[1] "60210 IL"

You could also do this using strsplit and grep (here I did it in 2 lines for readability):

xz1 <- "1620, Honeylocust Drive, 60210 IL, USA"
a1 <- strsplit(xz1, "[ ]*,[ ]*")[[1]]
grep("^[0-9]+[ ]+[A-Z]+", a1, value=TRUE)
#[1] "60210 IL"

It's not using gsub, and in the present case it is not better, but maybe it is easier to adapt to other situations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM