简体   繁体   中英

xpath regex doesn't search tail in lxml.etree

I'm working with lxml.etree and I'm trying to allow users to search a docbook for text. When a user provides the search text, I use the exslt match function to find the text within the docbook. The match works just fine if the text shows up within the element.text but not if the text is in element.tail .

Here's an example:

>>> # XML as lxml.etree element
>>> root = lxml.etree.fromstring('''
...   <root>
...     <foo>Sample text
...       <bar>and more sample text</bar> and important text.
...     </foo>
...   </root>
... ''')
>>>
>>> # User provides search text    
>>> search_term = 'important'
>>>
>>> # Find nodes with matching text
>>> matches = root.xpath('//*[re:match(text(), $search, "i")]', search=search_term, namespaces={'re':'http://exslt.org/regular-expressions'})
>>> print(matches)
[]
>>>
>>> # But I know it's there...
>>> bar = root.xpath('//bar')[0]
>>> print(bar.tail)
 and important text.

I'm confused because the text() function by itself returns all the text – including the tail :

>>> # text() results
>>> text = root.xpath('//child1/text()')
>>> print(text)
['Sample text',' and important text']

How come the tail isn't being included when I use the match function?

How come the tail isn't being included when I use the match function?

That's because in xpath 1.0, when given a node-set, match() function (or any other string function such as contains() , starts-with() , etc.) only take into account the first node.

Instead of what you did, you can use //text() and apply regex match filter on individual text nodes, and then return the text node's parent element, like so :

xpath = '//text()[re:match(., $search, "i")]/parent::*'
matches = root.xpath(xpath, search=search_term, namespaces={'re':'http://exslt.org/regular-expressions'})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM