简体   繁体   中英

Extract text between parentheses with suffix

Here is the exmple t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)' t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)' t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)' I want the output is the information in '()*' In this exmple is Shanghai Chart Center, Donghai Navigation Safety Administration of MOT

To match only the contents of (…)* , the tricky part is to avoid matching two unrelated parenthetical groups (ie something like (…) … (…)* ). The easiest way to accomplish this is to disallow closing parentheses inside the match:

stringr::str_match_all(t, r'{\(([^)]*)\)\*}')

Do note that this will fail for nested parentheses ( ( … ( … ) …)* ). Regular expressions are fundamentally unsuited to parse nested content so if you require handling such a case, regular expressions are not the appropriate tool; you'll need to use a context-free parser (which is a lot more complicated).

The key here is to use the non-greedy wildcard .*? , otherwise everything between the first ( and the last ) would be caught:

library(stringr)
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
str_extract_all(t, "(\\(.*?\\)\\*?)")[[1]] %>% str_subset("\\*$")
#> [1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"

Created on 2021-03-03 by the reprex package (v1.0.0)

You can use the rev() function if you want to reverse the order and get it right to left.

This is far less elegant than I would like it but unexpectedly "(\\(.*?\\)\\*)" is not non-greedy, so I had to detect it at the end of the string. You can add %>% str_remove_all("\\*$") if you want to discard the star in the end string.

Define a pattern that starts with ( , is followed by any characters except ( or ) (expressed as a negative character class [^)(]+ ) and closed by )* :

library(stringr)
str_extract_all(t, "\\([^)(]+\\)\\*")
[[1]]
[1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"

You can get rid of the list structure with unlist()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM