简体   繁体   中英

Regex using rebus in R splitting issue?

I have the following string pattern:

Name_session_id:Owner:UUID BUT sometimes it can be just Name:Owner:UUID.

For example:

John_1:David:enfl43erl34r345

or

John:David:enfl43erl34r345

I want to use stringr and rebus to be able to build a dataframe that looks like this:

Name   Session   Owner   UUID
John   1         David   enfl43erl34r345
John   NA        David   enfl43erl34r345

Please advise how to do this, here is what I have done so far with the pattern:

capture(one_or_more(WRD)) %R% 
  optional("_") %R% 
  capture(optional(DGT)) %R% 
  ":" %R% 
  capture(one_or_more(WRD)) %R% 
  ":" %R% 
  capture(one_or_more(WRD))

The problem is with the first one_or_more(WRD) , it matches _ , too, and the following _ and \\d? are not even tried since \\w+ grabs the whole chunk of letters, digits and underscores.

Replace the first one_or_more(WRD) with one_or_more(ALNUM) to only capture 1+ letters or digits into Group 1:

capture(one_or_more(ALNUM)) %R% 
  optional("_") %R% 
   capture(optional(DGT)) %R% 
    ":" %R% 
     capture(one_or_more(WRD)) %R% 
      ":" %R% 
       capture(one_or_more(WRD))

Or, make it lazy with lazy(one_or_more(WRD)) :

capture(lazy(one_or_more(WRD))) %R% 
  optional("_") %R% 
   capture(optional(DGT)) %R% 
    ":" %R% 
     capture(one_or_more(WRD)) %R% 
      ":" %R% 
       capture(one_or_more(WRD))

However, I believe you should use

capture(one_or_more(ALNUM)) %R% 
  optional(
    group("_" %R% 
     capture(one_or_more(DGT)))) %R% 
      ":" %R% 
       capture(one_or_more(WRD)) %R% 
        ":" %R% 
         capture(one_or_more(WRD))

It will create a regex like ([[:alnum:]]+)(?:_([\\d]+))?:([\\w]+):([\\w]+) . That is, instead of using _ as an optional char followed with an optional one_or_more(DGT) , you can wrap these two subsequent patterns with an optional group while making the patterns obligatory inside it.

Playing with some regex, you can rely solely on stringr::str_extract() :

library(stringr)
data.frame(
  Name = str_extract(data, "^[^:_]+"),
  Session = str_extract(data, "(?<=_).*?(?=:)"),
  Owner = str_extract(data, "(?<=:).*(?=:)"),
  UUID = str_extract(data, "[^:]*$"),
  stringsAsFactors = FALSE
)

  Name Session Owner            UUID
1 John       1 David enfl43erl34r345
2 John    <NA> David enfl43erl34r345

Not using rebus , but here is a no bullshit approach in base:

data:

df1 <-
data.frame(strings = c("John_1:David:enfl43erl34r345", "John:David:enfl43erl34r345"), stringsAsFactors = F)

code:

fun1 <- function(x) {
    ans <- strsplit(x, "^[^:]+\\K_(?=\\d)", perl = T)
    ans <- lapply(ans, strsplit, ":")
    ans <- unlist(ans)
    if(length(ans) == 3) { ans <- append(ans, NA, 1) }
    return(ans)
}

result <- as.data.frame(t(apply(df1, 1, fun1)), stringsAsFactors = F)
names(result) = c("Name", "Session", "Owner", "UUID")

result:

#  Name Session Owner            UUID
#1 John       1 David enfl43erl34r345
#2 John    <NA> David enfl43erl34r345

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM