[英]Extract text between parentheses with suffix
Here is the exmple t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
这里是例子
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
I want the output is the information in '()*' In this exmple is Shanghai Chart Center, Donghai Navigation Safety Administration of MOT
Yingjie
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
我要 output 是'()*'中的信息 本例是交通运输部Shanghai Chart Center, Donghai Navigation Safety Administration of MOT
To match only the contents of (…)*
, the tricky part is to avoid matching two unrelated parenthetical groups (ie something like (…) … (…)*
).要仅匹配
(…)*
的内容,棘手的部分是避免匹配两个不相关的括号组(即类似(…) … (…)*
)。 The easiest way to accomplish this is to disallow closing parentheses inside the match:实现这一点的最简单方法是在匹配项中禁止右括号:
stringr::str_match_all(t, r'{\(([^)]*)\)\*}')
Do note that this will fail for nested parentheses ( ( … ( … ) …)*
).请注意,对于嵌套括号(
( … ( … ) …)*
),这将失败。 Regular expressions are fundamentally unsuited to parse nested content so if you require handling such a case, regular expressions are not the appropriate tool;正则表达式根本不适合解析嵌套内容,因此如果您需要处理这种情况,正则表达式不是合适的工具; you'll need to use a context-free parser (which is a lot more complicated).
您需要使用上下文无关的解析器(这要复杂得多)。
The key here is to use the non-greedy wildcard .*?
这里的关键是使用非贪婪通配符
.*?
, otherwise everything between the first (
and the last )
would be caught: ,否则第一个
(
和最后一个)
之间的所有内容都会被捕获:
library(stringr)
t <- 'Hui Wan (Shanghai Maritime University); Mingqiang Xu (Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*; Yingjie Xiao ( Shanghai Maritime University)'
str_extract_all(t, "(\\(.*?\\)\\*?)")[[1]] %>% str_subset("\\*$")
#> [1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"
Created on 2021-03-03 by the reprex package (v1.0.0)由reprex package (v1.0.0) 于 2021 年 3 月 3 日创建
You can use the rev()
function if you want to reverse the order and get it right to left.如果您想颠倒顺序并从右到左,您可以使用
rev()
function。
This is far less elegant than I would like it but unexpectedly "(\\(.*?\\)\\*)"
is not non-greedy, so I had to detect it at the end of the string.这远没有我想要的优雅,但出乎意料的是
"(\\(.*?\\)\\*)"
不是非贪婪的,所以我不得不在字符串的末尾检测到它。 You can add %>% str_remove_all("\\*$")
if you want to discard the star in the end string.如果要丢弃末尾字符串中的星号,可以添加
%>% str_remove_all("\\*$")
。
Define a pattern that starts with (
, is followed by any characters except (
or )
(expressed as a negative character class [^)(]+
) and closed by )*
:定义一个以
(
开头的模式,后跟除(
或)
以外的任何字符(表示为负字符 class [^)(]+
)并以)*
结束:
library(stringr)
str_extract_all(t, "\\([^)(]+\\)\\*")
[[1]]
[1] "(Shanghai Chart Center, Donghai Navigation Safety Administration of MOT)*"
You can get rid of the list structure with unlist()
您可以使用
unlist()
摆脱列表结构
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.