简体   繁体   English

提取带后缀的括号之间的文本

[英]Extract text between parentheses with suffix

Here is the exmple t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'这里是例子t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)' t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)' t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)' I want the output is the information in '()*' In this exmple is Shanghai Chart Center, Donghai Navigation Safety Administration of MOT Yingjie t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'我要 output 是'()*'中的信息 本例是交通运输部Shanghai Chart Center, Donghai Navigation Safety Administration of MOT

To match only the contents of (…)* , the tricky part is to avoid matching two unrelated parenthetical groups (ie something like (…) … (…)* ).要仅匹配(…)*的内容,棘手的部分是避免匹配两个不相关的括号组(即类似(…) … (…)* )。 The easiest way to accomplish this is to disallow closing parentheses inside the match:实现这一点的最简单方法是在匹配项中禁止右括号:

stringr::str_match_all(t, r'{\(([^)]*)\)\*}')

Do note that this will fail for nested parentheses ( ( … ( … ) …)* ).请注意,对于嵌套括号( ( … ( … ) …)* ),这将失败。 Regular expressions are fundamentally unsuited to parse nested content so if you require handling such a case, regular expressions are not the appropriate tool;正则表达式根本不适合解析嵌套内容,因此如果您需要处理这种情况,正则表达式不是合适的工具; you'll need to use a context-free parser (which is a lot more complicated).您需要使用上下文无关的解析器(这要复杂得多)。

The key here is to use the non-greedy wildcard .*?这里的关键是使用非贪婪通配符.*? , otherwise everything between the first ( and the last ) would be caught: ,否则第一个(和最后一个)之间的所有内容都会被捕获:

library(stringr)
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
str_extract_all(t, "(\\(.*?\\)\\*?)")[[1]] %>% str_subset("\\*$")
#> [1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"

Created on 2021-03-03 by the reprex package (v1.0.0)reprex package (v1.0.0) 于 2021 年 3 月 3 日创建

You can use the rev() function if you want to reverse the order and get it right to left.如果您想颠倒顺序并从右到左,您可以使用rev() function。

This is far less elegant than I would like it but unexpectedly "(\\(.*?\\)\\*)" is not non-greedy, so I had to detect it at the end of the string.这远没有我想要的优雅,但出乎意料的是"(\\(.*?\\)\\*)"不是非贪婪的,所以我不得不在字符串的末尾检测到它。 You can add %>% str_remove_all("\\*$") if you want to discard the star in the end string.如果要丢弃末尾字符串中的星号,可以添加%>% str_remove_all("\\*$")

Define a pattern that starts with ( , is followed by any characters except ( or ) (expressed as a negative character class [^)(]+ ) and closed by )* :定义一个以(开头的模式,后跟除()以外的任何字符(表示为负字符 class [^)(]+ )并以)*结束:

library(stringr)
str_extract_all(t, "\\([^)(]+\\)\\*")
[[1]]
[1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"

You can get rid of the list structure with unlist()您可以使用unlist()摆脱列表结构

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM