I'm trying to parse an xml file with python lxml xpath, the structure is like this:
<body>
<tu changedate="20190822T080742Z" creationdate="20190822T085527Z" creationid="blank" changeid="blank">
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">Test_EN.docx</prop>
<prop type="x-Project Id">0001</prop>
<prop type="x-Product group">A</prop>
<prop type="x-Product">A</prop>
<prop type="x-Product">B</prop>
<prop type="x-TestList">TestValue1</prop>
<prop type="x-TestList">TestValue2</prop>
<prop type="x-Sample">SampleText</prop>
<prop type="x-Test">TestText</prop>
<prop type="x-Name">TestName</prop>
to dynamically find nodes with a function I save the names and values of nodes that I'm looking for to variable names.
node_name = x-Sample
node_value = SampleText
xpath_expression = f'//body/tu/prop[@type="{node_name}"][text()="{node_value}"]'
elements = tree.xpath(xpath_expression)
The problem is that node_value can contain double quotes and therefore produces an invalid xpath expression. Since I am stuck with lxml and it uses xpath 1.0 I can't escape them in the string.
Looking through stackoverflow I found that apparently this can only be done in xpath 1.0 by using concat. I also found the following function posted:
def xpath_string_escape(input_str):
""" creates a concatenation of alternately-quoted strings that is always a valid XPath expression """
parts = input_str.split('"')
return "concat('" + "', \"'\" , '".join(parts) + "', '')"
Which then gives me this:
xpath_expression = '//body/tu/tuv/prop[@type="x-Sample"][text()="concat('SampleText', '')"]'
However this doesn't return the nodes I'm looking for.
Alternative. You can remove the double quotes from your node value with:
node_value = translate(//prop[@type="x-Sample"]/text(),'"',"")
Then use contains() instead of text() to build your XPath expression:
xpath_expression = f'//body/tu/prop[@type="{node_name}"][contains(.,"{node_value}")]'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.