简体   繁体   中英

regex in R to extract value between two strings

I have lines that look like this

 01:04:43.064 [12439] <2> xyz
 01:04:43.067 [12439] <2> a lmn
 01:04:43.068 [12439] <4> j klm
 x_times_wait to <3000>
 01:04:43.068 [12439] <4> j klm
 enter_object <5000> main k

I want a regex to extract only the values after the angular brackets for lines that start with a timestamp

This is what I have tried - assuming that these lines are in a data frame called nn

 split<-str_split_fixed(nn[,1], ">", 2)
 split2<-data.frame(split[,2])

The problem is that split2 gives

   xyz
   a lmn
   j klm

   j klm
   main k

How can I make sure that the empty line and main k is not returned?

\d+(?::\d+){2}\.\d+\s+\[[^\]]+\]\s+<\d+>(.+)$

Instead of split try match and grab the group 1.See demo.

https://regex101.com/r/vN3sH3/16

or

Split by (?<=<\\d>) and get split2

If a timestamp is defined as 1 or more digits followed by a : , followed by 1 or more digits and another : and then 1 or more digits, then perhaps this method would work for you.

x <- c("01:04:43.064 [12439] <2> xyz", "01:04:43.067 [12439] <2> a lmn",   
       "01:04:43.068 [12439] <4> j klm", "x_times_wait to <3000>",  
       "01:04:43.068 [12439] <4> j klm", "enter_object <5000> main k")

sub(".*> ", "", x[grepl("\\d+:\\d+:\\d+", x)])
# [1] "xyz"   "a lmn" "j klm" "j klm"

This removes all the non-timestamp elements first, then gets the values after > with the remaining elements.

Here's an approach in base R:

The regex:

^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+

You can use it with gregexpr :

unlist(regmatches(vec, gregexpr("^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+", 
                                vec, perl = TRUE)))
# [1] "xyz"   "a lmn" "j klm" "j klm"

where vec is the vector containing your strings.

Using rex may make this type of task a little simpler.

string <- "01:04:43.064 [12439] <2> xyz
01:04:43.067 [12439] <2> a lmn
01:04:43.068 [12439] <4> j klm
x_times_wait to <3000>
01:04:43.068 [12439] <4> j klm
enter_object <5000> main k"

library(rex)

timestamp <- rex(n(digit, 2), ":", n(digit, 2), ":", n(digit, 2), ".", n(digit, 3))

re <- rex(timestamp, space,
          "[", digits, "]", space,
          "<", digits, ">", space,
          capture(anything))

re_matches(string, re, global = TRUE)

#> [[1]]
#>       1
#> 1   xyz
#> 2 a lmn
#> 3 j klm
#> 4 j klm

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM