Python Regex look behind

Question

I have the following text:

<clipPath id="p54dfe3d8fa">
   <path d="M 112.176 307.8 
L 112.176 307.8 
L 174.672 270 
L 241.632 171.72 
L 304.128 58.32 
L 380.016 171.72 
L 442.512 217.08 
L 491.616 141.48 
L 491.616 307.8 
z
"/>
  </clipPath>
  <clipPath id="p27c84a8b3c">
   <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
  </clipPath>

I need to grab this portion out:

d="M 112.176 307.8 
L 112.176 307.8 
L 174.672 270 
L 241.632 171.72 
L 304.128 58.32 
L 380.016 171.72 
L 442.512 217.08 
L 491.616 141.48 
L 491.616 307.8 
z
"

I need to replace this section with something else. I was able to grab the entirety of <clipPath ...><path d="[code i want]"/> but this doesn't help me because I can't override the id in the <clipPath> element.

Note that there are other <clipPath> elements that I do not want to touch. I only want to change <path> elements within <clipPath> elements.

I'm thinking that the answer has to do with selecting everything before a clipPath element and ending at the Path section. Any help would be entirely appreciated.

I've been using http://pythex.org/ for help and have also seen odd behavior (having to do with multiline and spaces) that don't act the same between that and python 3.x code.

Here are some of the things I've tried:

reg = r'(<clipPath.* id=".*".*>)'
reg = re.compile(r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")')
reg = re.compile(r'((?<!<clipPath).* id=".*".*>\s*<path.*d="(.*\n)+")')

g = reg.search(text)
g

Answer 1

regex is never the proper way of parsing xml.

Here's a simple standalone example which does it using lxml :

from lxml import etree

text="""<clipPath id="p54dfe3d8fa">
   <path d="M 112.176 307.8
L 112.176 307.8
L 174.672 270
L 241.632 171.72
L 304.128 58.32
L 380.016 171.72
L 442.512 217.08
L 491.616 141.48
L 491.616 307.8
z
"/>
  </clipPath>
  <clipPath id="p27c84a8b3c">
   <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
  </clipPath>"""

# This creates <metrics>
root = etree.XML("<X>"+text+"</X>")
p = root.find(".//path")
print(p.get("d"))

result:

M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z

first, I create the main node. Since there are several nodes, I wrap it in an arbitrary main node
then I look for "path" anywhere
once found, I get the d attribute

Now I'm changing the text for d and dump it:

p.set("d","[new text]")
print(etree.tostring(root))

now the output is like:

...
<path d="[new text]"/>\n
...

still, quick and dirty, maybe not robust to several path nodes, but works with the snippet you provided (and I'm no xml expert, just fumbling)

BTW, another hacky/non-regex way of doing it: using multi-character split :

text.split(' d="')[1].split('"/>')[0]

taking the second part after d delimiter, then the first part after /> delimiter. Preserves the multi-line formatting.

Answer 2

TL;DR: r'<clipPath.* id="[a-zA-Z0-9]+".*>\\s*<path.*d=("(?:.*\\n)+?")'

let's break that down...

you started with: r'(<clipPath.* id=".*".*>\\s*<path.*d="(.*\\n)+")' which enclosed your entire capture pattern inside a group, so the whole element would be captured in the match object. Let's take out those parenthesis: r'<clipPath.* id=".*".*>\\s*<path.*d="(.*\\n)+"'

next you seem to use .* quite often, which can be dangerous because it is blind and greedy. for the clipPath id, if you know the id is always alphanumeric, a better solution might be r'<clipPath.* id="[a-zA-Z0-9]+".*>\\s*<path.*d="(.*\\n)+"'

finally, let's look at what you actually want to capture. your example shows you want to capture the quotation marks, so let's get those inside our capture group: ...*d=("(.*\\n)+") . This leaves us with a weird nested group situation though, so let's make the inner group non-capturing: ...*d=("(?:.*\\n)+") .

now we're capturing what you want, but we still have a problem... what if there are multiple elements that satisfy these criteria? the greedy matching of the + in ...*d=("(.*\\n)+") will capture ever line in-between. What we can do here is to make the + non greedy by following it with a ? : ...*d=("(?:.*\\n)+?") .

put all these things together:

r'<clipPath.* id="[a-zA-Z0-9]+".*>\\s*<path.*d=("(?:.*\\n)+?")'

Answer 3

An xml based solution that edits the path.

import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parseString('<X>' + my_xml + '</X>')
collection = DOMTree.documentElement
for clip_path in collection.getElementsByTagName("clipPath"):
    paths = clip_path.getElementsByTagName('path')
    for path in paths:
        path.setAttribute('d', '[code i want]')

print DOMTree.toxml()

Data used:

my_xml = """
    <clipPath id="p54dfe3d8fa">
       <path d="M 112.176 307.8
    L 112.176 307.8
    L 174.672 270
    L 241.632 171.72
    L 304.128 58.32
    L 380.016 171.72
    L 442.512 217.08
    L 491.616 141.48
    L 491.616 307.8
    z
    "/>
      </clipPath>
      <clipPath id="p27c84a8b3c">
       <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
      </clipPath>
"""

Python Regex look behind

Question

3 answers

solution1
3 2017-01-27 20:13:12

solution2
2 ACCPTED 2017-01-27 20:38:25

solution3
1 2017-01-27 20:36:21

Python Regex look behind

Question

3 answers

solution1 3 2017-01-27 20:13:12

solution2 2 ACCPTED 2017-01-27 20:38:25

solution3 1 2017-01-27 20:36:21

solution1
3 2017-01-27 20:13:12

solution2
2 ACCPTED 2017-01-27 20:38:25

solution3
1 2017-01-27 20:36:21