简体   繁体   English

非贪婪 XML 中的多个匹配项(Python 正则表达式)

[英]Multiple matches in non-greedy XML (Python regex)

I know this topic is asked a lot but I couldn't find an answer to my question:我知道这个话题被问了很多,但我找不到我的问题的答案:

In the attached image there are many different buffers, and I wish to match only the buffers that have "Lut" in their names (notice there are 2 matches in the string in the image).在附加的图像中有许多不同的缓冲区,我希望只匹配名称中包含“Lut”的缓冲区(注意图像中的字符串中有 2 个匹配项)。 The problem I have is that the matches contain also the buffers that come before the one I want.我遇到的问题是匹配项还包含在我想要的缓冲区之前的缓冲区。

http://i.stack.imgur.com/p0fPS.png

I'm pretty new to regex and still trying to learn as much so any explanation will be appreciated.我对正则表达式很陌生,并且仍在努力学习,因此任何解释都将不胜感激。

Thank you!谢谢! :) :)

The string is attached for you comfort (if needed):为了您的舒适(如果需要),请附上绳子:

<?xml version="1.0" encoding="utf-8"?>
<pimp xmlns:dt="urn:schemas-microsoft-com:datatypes">
    <dllPath>C:\ReplayCode\Apps\Pimp</dllPath>
    <buffers>   
    <buffer name="InputMask">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="MaskErode">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="BlablaLutBla">
            <width>256</width>
            <height>256</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="MaskClose">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="InputVis">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>3</channels>
            <type>IMG</type>
    </buffer>   
        <buffer name="AddMaskEdge">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="EdgeVis">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>3</channels>
            <type>IMG</type>
    </buffer>       
        <buffer name="GrayEdge">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="EdgeMaskMulThreshold">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>
    <buffer name="MaskMulEdge">
            <width>5120</width>
            <height>3072</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>   
    </buffers>  

The regex I tried is this:我试过的正则表达式是这样的:

<buffer name=".*?Lut.*?">.*?<\/buffer>

And I expected 2 matches:我预计有 2 场比赛:

<buffer name="BlablaLutBla">
            <width>256</width>
            <height>256</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>

and

<buffer name="2ndLutBlabla">
            <width>256</width>
            <height>256</height>
            <data>UCHAR</data>
            <channels>1</channels>
            <type>IMG</type>
    </buffer>

You can use BeautifulSoup to parse your tag.您可以使用BeautifulSoup来解析您的标签。

import re
from bs4 import BeautifulSoup

input_xml = ''' some xml '''
soup = BeautifulSoup(input_xml, "lxml-xml")
print soup.find_all('buffer', attrs={"name": re.compile('Lut')})

If you do not have this installed already:如果你还没有安装这个:

pip install beautifulsoup4
pip install lxml

Since you need to manipulate the data inside an XML document, use an XML parser.由于您需要操作 XML 文档中的数据,因此请使用 XML 解析器。 An answer above already shows how to instantiate the XML tree, but does not dwell upon the structure modification.上面的答案已经展示了如何实例化 XML 树,但并未详细说明结构修改。

BTW, if you instantiate the XML from a string, use ET.fromstring顺便说一句,如果您从字符串实例化 XML,请使用ET.fromstring

import xml.etree.ElementTree as ET
...
xml = "<<YOUR XML STRING>>" 
root = ET.fromstring(xml)

Else, when reading from a file:否则,从文件读取时:

tree = ET.parse('file.xml')
root = tree.getroot()

Then, you can use the following replacements (where you can actually use a regex if necessary, because here you will already have to deal with plain, unmarked up text data):然后,您可以使用以下替换(如果需要,您实际上可以使用正则表达式,因为在这里您将不得不处理普通的、未标记的文本数据):

for buffer in root.findall("buffers/buffer"): 
    if "Lut" in buffer.get("name"):
        buffer.find('width').text = "100"    # Set inner text of buffer child named 'width'
        buffer[1].text = "125"               # Set the 2nd child inner text
        buffer.set('type', 'MY_TYPE');       # Add an attribute to buffer

You can print the updated XML using .dump() :您可以使用.dump()打印更新的 XML:

print ET.dump(root)                          # Print updated XML

Or write an updated DOM to the file (if you are working with a file):或者将更新的 DOM 写入文件(如果您正在处理文件):

tree.write('output.xml')

See IDEONE demo showing modifications on an XML string.请参阅IDEONE 演示,显示对 XML 字符串的修改。

You might want to use xml parsing in python instead, it is quite easy:你可能想在 python 中使用 xml 解析,这很容易:

import xml.etree.ElementTree as ET
tree = ET.parse(xml)
for buffer in tree.findall("buffers/buffer"): 
    if "Lut" in buffer.get("name"):
        # do your stuff
        pass
<buffer name="[^"]*Lut[^"]*">.*?<\/buffer>

See Demo演示

In your regex's <buffer name=".*?Lut , it will match from the first <buffer to the first Lut .( non-greedy worked.If greedy,it will match the last Lut )在您的正则表达式的<buffer name=".*?Lut ,它将从第一个<buffer匹配到第一个Lut 。( non-greedy worked.If greedy,it will match the last Lut

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM