R regex: split a string by combination of \\n [A-z] & [:punct:]

Question

I have a dataframe with character strings that look like this:

bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea

I want to split lines separated by \\n(number):(number):(number) sequence into different rows. I tried

stringr::separate_rows(df3$Transcript[1], Transcript , sep = "\\n")

and its different combinations with [Az] and [:punct:] to no avail. What would be the most straight forward way of doing it?

Thanks

Answer 1

You want to split the strings with a line break that is followed with a timestamp. You may use a base R strsplit function with a PCRE regex based on a positive lookahead:

strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)

See the regex demo

Pattern details

\\R+ - 1 or more line break sequences (either \\n or \\r or \\r\\n )
(?=\\d{2}:\\d{2}:\\d{2}) - followed with 2 digits, : , 2 digits, : and again 2 digits. Since (?=...) is a positive lookahead (a zero-width assertion that does not put the matched chars into the match value) the text matched with it is not removed from the results.

R demo :

s <- "bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea"
strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)

Output:

[[1]]
[1] "bla bla."                           "14:39:51 info: pyku bla ."         
[3] "14:39:51 info: "                    "14:39:51 info: "                   
[5] "14:39:57 Sam: <span>pyk pyk</span>" "14:43:15 on and on "               
[7] "14:43:59 you get an idea"

R regex: split a string by combination of \\n [A-z] & [:punct:]

Question

1 answers

solution1
2 ACCPTED 2017-10-12 09:04:19

R regex: split a string by combination of \\n [A-z] & [:punct:]

Question

1 answers

solution1 2 ACCPTED 2017-10-12 09:04:19

solution1
2 ACCPTED 2017-10-12 09:04:19