简体   繁体   中英

Regex pattern questions in r

I need to match author and time from string in R.

test = "Postedby   BeauHDon Friday November 24, 2017 @10:30PM from the cost-effective dept."

I am currently using gsub() to find the desired output.

Expected output would be:

#author
"BeauHDon"
#Month
"November"
#Date
24
#Time
22:30

I got to gsub("Postedby (.*).*", "\\\\1", test) but the output is

"BeauHDon Friday November 24, 2017 @10:30PM from the cost-effective dept."

Also I understand time requires more more coding after extracting 10:30 .

Is it possible to add 12 if next two string is PM ?

Thank you.

We can extract using capturing as a group (assuming that the patterns are as shown in the example). Here the pattern is to match one or more non-white spaces ( \\\\S+ ) followed by spaces ( \\\\s+ ) from the start ( ^ ) of the string, followed by word which we capture in a group ( \\\\w+ ), followed by capturing word after we skip the next word and space, then get the numbers ( (\\\\d+) ) and the time that follows the @

v1 <- scan(text=sub("^\\S+\\s+(\\w+)\\s+\\w+\\s+(\\w+)\\s+(\\d+)[^@]+@(\\S+).*",
           "\\1,\\2,\\3,\\4", test), what = "", sep=",", quiet = TRUE)

As the last entry is time, we can convert it to datetime with strptime and change the format , assign it to the last element

v1[4] <- format(strptime(v1[4],  "%I:%M %p"), "%H:%M")

If needed, set the names of the element with author, Month etc.

names(v1) <- c("#author", "#Month", "#Date", "#Time")
v1
#  #author     #Month      #Date      #Time 
#"BeauHDon" "November"       "24"    "22:30" 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM