简体   繁体   中英

regex in R “eats” part of the string

I want to split a character string into two groups. The string's structure is pretty simple, yet I haven't been able to make it work.

txt <- "text12-01-2016"

It's always some letters, followed by a date, and the date, obviously starts with a number. I've tried the following regex at https://regex101.com/ and effectively get the string properly separated:

([a-zA-Z]*)([0-9].*)
1. "text"
2. "12-01-2016"

But when I try in R it fails:

strsplit(a[1],split = "([a-zA-Z]*)([0-9]*)")
[[1]]
 [1] ""  " " ""  "." " " ""  " " ""  "-" ""  "-" "" 

And if I introduce double square brackets, then it "eats" out the last character of the first group, and the first of the second:

strsplit(txt,split = "([[a-zA-Z]]*)([[0-9]]*)")
[[1]]
[1] "tex"      "2-01-2016"

It doesn't matter if I use perl=TRUE . Result is consistent also if I use stringi::stri_split , so it's a problem in my regex.

What is the correct regex to use in this case?

The "problem" here is that you have a regex for matching , not for splitting .

You can use the following PCRE regex with strsplit :

strsplit(txt,split = "(?<=[a-zA-Z])(?=[0-9])", perl=T)
[[1]]
[1] "text"       "12-01-2016"

The regex will match the location between a letter and a digit and strsplit will split the result. You can unlist it further on if you need.

If you want to use your regex, use str_match from stringr :

> library(stringr)
>str_match(txt,  "([a-zA-Z]*)([0-9].*)")
     [,1]             [,2]   [,3]        
[1,] "text12-01-2016" "text" "12-01-2016"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM