[英]Extracting words inside bracket using regex in python
What will be the regular expression for the following pattern shown in the image below?下图中显示的以下模式的正则表达式是什么? (Note: there are many more tags and in no specific order.there is a lot of information between the tags that dont follow this pattern. i just need to extract the information within the large bracket) (注意:标签还有很多,没有特定的顺序。标签之间有很多信息不遵循这种模式。我只需要提取大括号内的信息)
I need to seperate the data inside the large bracket seperately.我需要单独将大括号内的数据分开。 for eg severity and 2. So far, i have only been able to collect the data having such large brackets using r'\\[([^]]*)\\]'
.例如,严重性和 2。到目前为止,我只能使用r'\\[([^]]*)\\]'
收集具有如此大括号的数据。 how do i seperate them?我如何将它们分开? and please do explain.请解释一下。 I am familiar with regex symbols but cannot work my head around with these complicated patterns.我熟悉正则表达式符号,但无法处理这些复杂的模式。
You may use您可以使用
import re
rx = re.compile("""\[(?P<key>[^\]\[\s]+)(?:\s+"(?P<value>[^"]+)")?\]""")
text = """lorem ipsum [severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"] lorem ipsum"""
result = {m.group('key'): m.group('value') for m in rx.finditer(text)}
print(result)
Which yields哪个产量
{'severity': '2', 'maturity': '0', 'accuracy': '0', 'tag': 'application-multi'}
import re
value = '[severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"]'
print(re.findall(r'\[(\w+)\s+"([^"]+)"\]', value))
This will give you the keys and values: [('severity', '2'), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]
这将为您提供键和值: [('severity', '2'), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]
If you want a dictionary that's easy: print(dict(re.findall(r'\\[(\\w+)\\s+"([^"]+)"\\]', value)))
如果你想要一本简单的字典: print(dict(re.findall(r'\\[(\\w+)\\s+"([^"]+)"\\]', value)))
Now the explanation of the regular expression.现在解释正则表达式。 First looking for an opening bracket: \\[
(escaped).首先寻找一个左括号: \\[
(转义)。 Then catch the word characters: (\\w+)
.然后捕捉单词字符: (\\w+)
。 Then one or more spaces followed by a double quote: \\s+"
. Then we catch everything that's not a double quote: ([^"]+)
.然后一个或多个空格后跟一个双引号: \\s+"
。然后我们捕获所有不是双引号的内容: ([^"]+)
。 Finally find the double quote followed by the closing bracket: "\\]
.最后找到双引号后跟右括号: "\\]
。
I suggest using re.finditer
to loop over matches, and use these to create a dictionary:我建议使用re.finditer
来循环匹配,并使用这些来创建字典:
import re
text = '[severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"]'
tags = {m.group(1): m.group(2)
for m in re.finditer('\[(.*?)\s*"(.*?)"\]', text)}
print(tags)
{'severity': '2', 'ver': '', 'maturity': '0', 'accuracy': '0', 'tag': 'application-multi'}
This makes it convenient to extract data items, but it does of course assume that keys are unique.这使得提取数据项很方便,但它当然假设键是唯一的。 If they are not, then you could instead use for example a list of 2-tuples:如果不是,那么您可以改用例如 2 元组列表:
[(m.group(1), m.group(2))
for m in re.finditer('\[(.*?)\s*"(.*?)"\]', text)]
[('severity', '2'), ('ver', ''), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]
If you want both the first and second word of each pair:如果您想要每对的第一个和第二个单词:
>>> import re
>>> inp = '[severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"]'
>>> list_of_tuples = re.findall(r'\[(\w+) \"(.*?)\"\]', inp)
>>> list_of_tuples
[('severity', '2'), ('ver', ''), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]
Use用
\[([^][]+?)(?:\s+"([^"]*)")?]
Explanation解释
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^][]+? any character except: ']', '[' (1 or
more times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
] ']'
Python code : 蟒蛇代码:
import re
expression = r'\[([^][]+?)(?:\s+"([^"]*)")?]'
test = 'lorem ipsum [severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"] lorem ipsum'
print( {x.group(1):x.group(2) for x in re.finditer(expression, test)} )
Result:结果:
{'severity': '2', 'ver': '', 'maturity': '0', 'accuracy': '0', 'tag': 'application-multi'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.