regex in R to extract value between two strings

Question

I have lines that look like this

 01:04:43.064 [12439] <2> xyz
 01:04:43.067 [12439] <2> a lmn
 01:04:43.068 [12439] <4> j klm
 x_times_wait to <3000>
 01:04:43.068 [12439] <4> j klm
 enter_object <5000> main k

I want a regex to extract only the values after the angular brackets for lines that start with a timestamp

This is what I have tried - assuming that these lines are in a data frame called nn

 split<-str_split_fixed(nn[,1], ">", 2)
 split2<-data.frame(split[,2])

The problem is that split2 gives

   xyz
   a lmn
   j klm

   j klm
   main k

How can I make sure that the empty line and main k is not returned?

Answer 1

\d+(?::\d+){2}\.\d+\s+\[[^\]]+\]\s+<\d+>(.+)$

Instead of split try match and grab the group 1.See demo.

https://regex101.com/r/vN3sH3/16

or

Split by (?<=<\\d>) and get split2

Answer 2

If a timestamp is defined as 1 or more digits followed by a : , followed by 1 or more digits and another : and then 1 or more digits, then perhaps this method would work for you.

x <- c("01:04:43.064 [12439] <2> xyz", "01:04:43.067 [12439] <2> a lmn",   
       "01:04:43.068 [12439] <4> j klm", "x_times_wait to <3000>",  
       "01:04:43.068 [12439] <4> j klm", "enter_object <5000> main k")

sub(".*> ", "", x[grepl("\\d+:\\d+:\\d+", x)])
# [1] "xyz"   "a lmn" "j klm" "j klm"

This removes all the non-timestamp elements first, then gets the values after > with the remaining elements.

Answer 3

Here's an approach in base R:

The regex:

^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+

You can use it with gregexpr :

unlist(regmatches(vec, gregexpr("^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+", 
                                vec, perl = TRUE)))
# [1] "xyz"   "a lmn" "j klm" "j klm"

where vec is the vector containing your strings.

Answer 4

Using rex may make this type of task a little simpler.

string <- "01:04:43.064 [12439] <2> xyz
01:04:43.067 [12439] <2> a lmn
01:04:43.068 [12439] <4> j klm
x_times_wait to <3000>
01:04:43.068 [12439] <4> j klm
enter_object <5000> main k"

library(rex)

timestamp <- rex(n(digit, 2), ":", n(digit, 2), ":", n(digit, 2), ".", n(digit, 3))

re <- rex(timestamp, space,
          "[", digits, "]", space,
          "<", digits, ">", space,
          capture(anything))

re_matches(string, re, global = TRUE)

#> [[1]]
#>       1
#> 1   xyz
#> 2 a lmn
#> 3 j klm
#> 4 j klm

regex in R to extract value between two strings

Question

4 answers

solution1
3 2014-12-18 18:21:23

solution2
2 2014-12-18 18:31:43

solution3
0 2014-12-18 19:07:33

solution4
0 2014-12-19 14:49:40

regex in R to extract value between two strings

Question

4 answers

solution1 3 2014-12-18 18:21:23

solution2 2 2014-12-18 18:31:43

solution3 0 2014-12-18 19:07:33

solution4 0 2014-12-19 14:49:40

solution1
3 2014-12-18 18:21:23

solution2
2 2014-12-18 18:31:43

solution3
0 2014-12-18 19:07:33

solution4
0 2014-12-19 14:49:40