简体   繁体   English

在python中使用正则表达式提取括号内的单词

[英]Extracting words inside bracket using regex in python

What will be the regular expression for the following pattern shown in the image below?下图中显示的以下模式的正则表达式是什么? (Note: there are many more tags and in no specific order.there is a lot of information between the tags that dont follow this pattern. i just need to extract the information within the large bracket) (注意:标签还有很多,没有特定的顺序。标签之间有很多信息不遵循这种模式。我只需要提取大括号内的信息) 数据模式

I need to seperate the data inside the large bracket seperately.我需要单独将大括号内的数据分开。 for eg severity and 2. So far, i have only been able to collect the data having such large brackets using r'\\[([^]]*)\\]' .例如,严重性和 2。到目前为止,我只能使用r'\\[([^]]*)\\]'收集具有如此大括号的数据。 how do i seperate them?我如何将它们分开? and please do explain.请解释一下。 I am familiar with regex symbols but cannot work my head around with these complicated patterns.我熟悉正则表达式符号,但无法处理这些复杂的模式。

You may use您可以使用

import re

rx = re.compile("""\[(?P<key>[^\]\[\s]+)(?:\s+"(?P<value>[^"]+)")?\]""")
text = """lorem ipsum [severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"] lorem ipsum"""

result = {m.group('key'): m.group('value') for m in rx.finditer(text)}
print(result)

Which yields哪个产量

{'severity': '2', 'maturity': '0', 'accuracy': '0', 'tag': 'application-multi'}

See a demo on regex101.com .在 regex101.com 上查看演示

import re
value = '[severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"]'
print(re.findall(r'\[(\w+)\s+"([^"]+)"\]', value))

This will give you the keys and values: [('severity', '2'), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]这将为您提供键和值: [('severity', '2'), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]

If you want a dictionary that's easy: print(dict(re.findall(r'\\[(\\w+)\\s+"([^"]+)"\\]', value)))如果你想要一本简单的字典: print(dict(re.findall(r'\\[(\\w+)\\s+"([^"]+)"\\]', value)))

Now the explanation of the regular expression.现在解释正则表达式。 First looking for an opening bracket: \\[ (escaped).首先寻找一个左括号: \\[ (转义)。 Then catch the word characters: (\\w+) .然后捕捉单词字符: (\\w+) Then one or more spaces followed by a double quote: \\s+" . Then we catch everything that's not a double quote: ([^"]+) .然后一个或多个空格后跟一个双引号: \\s+" 。然后我们捕获所有不是双引号的内容: ([^"]+) Finally find the double quote followed by the closing bracket: "\\] .最后找到双引号后跟右括号: "\\]

I suggest using re.finditer to loop over matches, and use these to create a dictionary:我建议使用re.finditer来循环匹配,并使用这些来创建字典:

import re

text = '[severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"]'

tags = {m.group(1): m.group(2)
        for m in re.finditer('\[(.*?)\s*"(.*?)"\]', text)}

print(tags)
{'severity': '2', 'ver': '', 'maturity': '0', 'accuracy': '0', 'tag': 'application-multi'}

This makes it convenient to extract data items, but it does of course assume that keys are unique.这使得提取数据项很方便,但它当然假设键是唯一的。 If they are not, then you could instead use for example a list of 2-tuples:如果不是,那么您可以改用例如 2 元组列表:

[(m.group(1), m.group(2))
 for m in re.finditer('\[(.*?)\s*"(.*?)"\]', text)]
[('severity', '2'), ('ver', ''), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]

If you want both the first and second word of each pair:如果您想要每对的第一个和第二个单词:

>>> import re
>>> inp = '[severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"]'
>>> list_of_tuples = re.findall(r'\[(\w+) \"(.*?)\"\]', inp)
>>> list_of_tuples
[('severity', '2'), ('ver', ''), ('maturity', '0'), ('accuracy', '0'), ('tag', 'application-multi')]

Use

\[([^][]+?)(?:\s+"([^"]*)")?]

See proof查看证明

Explanation解释

--------------------------------------------------------------------------------
  \[                       '['
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^][]+?                  any character except: ']', '[' (1 or
                             more times (matching the least amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \s+                      whitespace (\n, \r, \t, \f, and " ") (1
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    "                        '"'
--------------------------------------------------------------------------------
    (                        group and capture to \2:
--------------------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \2
--------------------------------------------------------------------------------
    "                        '"'
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
  ]                        ']'

Python code : 蟒蛇代码

import re
expression = r'\[([^][]+?)(?:\s+"([^"]*)")?]'
test = 'lorem ipsum [severity "2"] [ver ""] [maturity "0"] [accuracy "0"] [tag "application-multi"] lorem ipsum'
print( {x.group(1):x.group(2) for x in re.finditer(expression, test)} )

Result:结果:

{'severity': '2', 'ver': '', 'maturity': '0', 'accuracy': '0', 'tag': 'application-multi'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM