I have a dataframe with character strings that look like this:
bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea
I want to split lines separated by \\n(number):(number):(number)
sequence into different rows. I tried
stringr::separate_rows(df3$Transcript[1], Transcript , sep = "\\n")
and its different combinations with [Az]
and [:punct:]
to no avail. What would be the most straight forward way of doing it?
Thanks
You want to split the strings with a line break that is followed with a timestamp. You may use a base R strsplit
function with a PCRE regex based on a positive lookahead:
strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)
See the regex demo
Pattern details
\\R+
- 1 or more line break sequences (either \\n
or \\r
or \\r\\n
) (?=\\d{2}:\\d{2}:\\d{2})
- followed with 2 digits, :
, 2 digits, :
and again 2 digits. Since (?=...)
is a positive lookahead (a zero-width assertion that does not put the matched chars into the match value) the text matched with it is not removed from the results. R demo :
s <- "bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea"
strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)
Output:
[[1]]
[1] "bla bla." "14:39:51 info: pyku bla ."
[3] "14:39:51 info: " "14:39:51 info: "
[5] "14:39:57 Sam: <span>pyk pyk</span>" "14:43:15 on and on "
[7] "14:43:59 you get an idea"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.