简体   繁体   中英

Extract name from email using regex in R

I have a string - which is chain of emails, i needed to extract the name of the sender (From :) . Find below a sample of email

str1 <- 'From : Wendy YEOW (SLA) To : xxxx@lt.org Subject : RE: OneService@S
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S 
From: Siti Zaharah RAMAN (ARKS) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S 
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S 
From: Chin Hwang LAU (TA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S'

I have the below code - to extract the names

str_extract_all(string=str1,pattern="\\b(From\\s*[:]+\\s*(\\w*))\\b")[[1]]
[1] "From : Wendy" "From: SLA"    "From: Siti"   "From: SLA"    "From: Chin"

But my desired output is:

[1] "Wendy YEOW (SLA)"    "SLA Enquiry (SLA)"    "Siti Zaharah RAMAN (ARKS)"   "SLA Enquiry (SLA)"    "Chin Hwang LAU (TA)"

Try this regular expression together with strsplit() :

gsub("From *: (.*?) (To|Sent).*", "\\1", strsplit(str1, "\n")[[1]])

[1] "Wendy YEOW (SLA)"         
[2] "SLA Enquiry (SLA)"        
[3] "Siti Zaharah RAMAN (ARKS)"
[4] "SLA Enquiry (SLA)"        
[5] "Chin Hwang LAU (TA)" 

This works because I am using a back reference ( \\\\1 ) to extract the wildcard in the first set of parentheses.

You can use strsplit . There's no need for gsub here.

strsplit(str1, "From ?: | (To|Sent) ?:.*?(\\nFrom ?: |$)")[[1]][-1]
# [1] "Wendy YEOW (SLA)"          "SLA Enquiry (SLA)"         "Siti Zaharah RAMAN (ARKS)"
# [4] "SLA Enquiry (SLA)"         "Chin Hwang LAU (TA)"  

The regex basically consists of two parts:

  1. "From ?: " : This ist the beginning of the string. The split returns an empty string and the rest of the original string.
  2. " (To|Sent) ?:.*?(\\\\nFrom ?: |$)" : This regex represents the text after the name. It includes the substring starting with "To" or "Sent" and ending with a line break ( "\\\\n" ) followed by the next "From" or the end of the string ( "$" ).

Finally, the [-1] is necessary to remove the empty string (preceding the first "From" ).

Not much elegant, but you can try:

gsub(" *(From|To|Sent) *:? *","",regmatches(str1,gregexpr("From *:[^:]+",str1))[[1]])
#[1] "Wendy YEOW (SLA)"          "SLA Enquiry (SLA)"        
#[3] "Siti Zaharah RAMAN (ARKS)" "SLA Enquiry (SLA)"        
#[5] "Chin Hwang LAU (TA)"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM