[英]Comma separated words regular expression
I'm having some issues while trying to parse expression such as the following: 尝试解析表达式时遇到一些问题,例如:
word1, word2[a,b,c], word3, ..., wordN
I'd like to get the following groups: 我想得到以下团体:
g1: word1
g2: word2[a,b,c]
g3: word3
Please note that the [.+] is optional, the regular expression must be able to match expressions like the following the following: 请注意,[。+]是可选的,正则表达式必须能够匹配以下表达式:
word1,word2,word3
word1[a,b,c],word2,word3
word1[a,b,c],word2[e,f,g],word3
word1[a,b,c],word2[e,f,g],word3[i,j,l]
I did some attempts but I can't find the way to correctly separate the groups. 我做了一些尝试,但找不到正确分离组的方法。
I tried this regex on https://regex101.com , and pasted your expressions into the "test strings" box. 我在https://regex101.com上尝试了此正则表达式,并将您的表达式粘贴到“测试字符串”框中。
/^([a-zA-Z0-9]+(?:\[.*\])?),([a-zA-Z0-9]+(?:\[.*\])?),([a-zA-Z0-9]+(?:\[.*\])?)$/gm
Each word is separated by a comma, and of the form: 每个单词都用逗号分隔,形式为:
([a-zA-Z0-9]+(?:\[.*\])?)
Explanation: 说明:
(
[a-zA-Z0-9]+ # one or more alphanumeric characters (could use \w)
(?:\[.*\])? # an optional sequence surrounded by []s. (?: ) means a non-capturing group
)
For the time being this seems to be working: 目前看来这是可行的:
import re
rgx = re.compile("(\w+(\[.*?\])*).*?,?")
[key for key, val in rgx.findall("word1, word2[a,b,[c,,,]], word,3")]
# this regex starts by looking for alpha numberic characters with \w+
# then within that it looks if a `[` is present then till we encounter end of bracket ']' consider everything (\[.*?\])*.
# the output of this is a tuple as ('word2[a,b,c]', '[a,b,c]')
# we iterate over the tuple and take only the 1st values in the tuple
output: 输出:
['word1', 'word2[a,b,[c,,,]', 'word', '3']
another example 另一个例子
[key for key, val in rgx.findall("word1[bbbb,cccc],word2[bbbb,cccc] ")]
output: 输出:
['word1[bbbb,cccc]', 'word2[bbbb,cccc]']
PS: still regexing to improve it. PS:仍在为改善它而感到遗憾。
You can use re.split
to split only on commas, that are outside of brackets. 您可以使用
re.split
来仅分割方括号中的逗号。 This can be determinded by the fact, that those commas will never match a closing bracket before the opening one (using a negative lookahead). 这可以通过以下事实来确定:这些逗号永远不会在打开方括号之前(使用负前瞻)与关闭方括号匹配。 This trick is only possible with non-nested brackets.
此技巧仅适用于非嵌套括号。
import re
print(re.split(r',(?![^[]*\])', 'word1[a,b,c],word2[e,f,g],word3'))
outputs ['word1[a,b,c]', 'word2[e,f,g]', 'word3']
输出
['word1[a,b,c]', 'word2[e,f,g]', 'word3']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.