简体   繁体   中英

Python Regex look behind

I have the following text:

<clipPath id="p54dfe3d8fa">
   <path d="M 112.176 307.8 
L 112.176 307.8 
L 174.672 270 
L 241.632 171.72 
L 304.128 58.32 
L 380.016 171.72 
L 442.512 217.08 
L 491.616 141.48 
L 491.616 307.8 
z
"/>
  </clipPath>
  <clipPath id="p27c84a8b3c">
   <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
  </clipPath>

I need to grab this portion out:

d="M 112.176 307.8 
L 112.176 307.8 
L 174.672 270 
L 241.632 171.72 
L 304.128 58.32 
L 380.016 171.72 
L 442.512 217.08 
L 491.616 141.48 
L 491.616 307.8 
z
"

I need to replace this section with something else. I was able to grab the entirety of <clipPath ...><path d="[code i want]"/> but this doesn't help me because I can't override the id in the <clipPath> element.

Note that there are other <clipPath> elements that I do not want to touch. I only want to change <path> elements within <clipPath> elements.

I'm thinking that the answer has to do with selecting everything before a clipPath element and ending at the Path section. Any help would be entirely appreciated.

I've been using http://pythex.org/ for help and have also seen odd behavior (having to do with multiline and spaces) that don't act the same between that and python 3.x code.

Here are some of the things I've tried:

reg = r'(<clipPath.* id=".*".*>)'
reg = re.compile(r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")')
reg = re.compile(r'((?<!<clipPath).* id=".*".*>\s*<path.*d="(.*\n)+")')

g = reg.search(text)
g

regex is never the proper way of parsing xml.

Here's a simple standalone example which does it using lxml :

from lxml import etree

text="""<clipPath id="p54dfe3d8fa">
   <path d="M 112.176 307.8
L 112.176 307.8
L 174.672 270
L 241.632 171.72
L 304.128 58.32
L 380.016 171.72
L 442.512 217.08
L 491.616 141.48
L 491.616 307.8
z
"/>
  </clipPath>
  <clipPath id="p27c84a8b3c">
   <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
  </clipPath>"""

# This creates <metrics>
root = etree.XML("<X>"+text+"</X>")
p = root.find(".//path")
print(p.get("d"))

result:

M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z 
  • first, I create the main node. Since there are several nodes, I wrap it in an arbitrary main node
  • then I look for "path" anywhere
  • once found, I get the d attribute

Now I'm changing the text for d and dump it:

p.set("d","[new text]")
print(etree.tostring(root))

now the output is like:

...
<path d="[new text]"/>\n
...

still, quick and dirty, maybe not robust to several path nodes, but works with the snippet you provided (and I'm no xml expert, just fumbling)

BTW, another hacky/non-regex way of doing it: using multi-character split :

text.split(' d="')[1].split('"/>')[0]

taking the second part after d delimiter, then the first part after /> delimiter. Preserves the multi-line formatting.

TL;DR: r'<clipPath.* id="[a-zA-Z0-9]+".*>\\s*<path.*d=("(?:.*\\n)+?")'

let's break that down...

you started with: r'(<clipPath.* id=".*".*>\\s*<path.*d="(.*\\n)+")' which enclosed your entire capture pattern inside a group, so the whole element would be captured in the match object. Let's take out those parenthesis: r'<clipPath.* id=".*".*>\\s*<path.*d="(.*\\n)+"'

next you seem to use .* quite often, which can be dangerous because it is blind and greedy. for the clipPath id, if you know the id is always alphanumeric, a better solution might be r'<clipPath.* id="[a-zA-Z0-9]+".*>\\s*<path.*d="(.*\\n)+"'

finally, let's look at what you actually want to capture. your example shows you want to capture the quotation marks, so let's get those inside our capture group: ...*d=("(.*\\n)+") . This leaves us with a weird nested group situation though, so let's make the inner group non-capturing: ...*d=("(?:.*\\n)+") .

now we're capturing what you want, but we still have a problem... what if there are multiple elements that satisfy these criteria? the greedy matching of the + in ...*d=("(.*\\n)+") will capture ever line in-between. What we can do here is to make the + non greedy by following it with a ? : ...*d=("(?:.*\\n)+?") .

put all these things together:

r'<clipPath.* id="[a-zA-Z0-9]+".*>\\s*<path.*d=("(?:.*\\n)+?")'

An xml based solution that edits the path.

import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parseString('<X>' + my_xml + '</X>')
collection = DOMTree.documentElement
for clip_path in collection.getElementsByTagName("clipPath"):
    paths = clip_path.getElementsByTagName('path')
    for path in paths:
        path.setAttribute('d', '[code i want]')

print DOMTree.toxml()

Data used:

my_xml = """
    <clipPath id="p54dfe3d8fa">
       <path d="M 112.176 307.8
    L 112.176 307.8
    L 174.672 270
    L 241.632 171.72
    L 304.128 58.32
    L 380.016 171.72
    L 442.512 217.08
    L 491.616 141.48
    L 491.616 307.8
    z
    "/>
      </clipPath>
      <clipPath id="p27c84a8b3c">
       <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
      </clipPath>
"""

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM