简体   繁体   中英

How to use regular expression in lxml xpath?

I'm using construction like this:

doc = parse(url).getroot()
links = doc.xpath("//a[text()='some text']")

But I need to select all links which have text beginning with "some text", so I'm wondering is there any way to use regexp here? Didn't find anything in lxml documentation

You can do this (although you don't need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class , but it also works for the xpath() method)

doc.xpath("//a[re:match(text(), 'some text')]", 
        namespaces={"re": "http://exslt.org/regular-expressions"})

Note that you need to give the namespace mapping, so that it knows what the "re" prefix in the xpath expression stands for.

您可以使用starts-with()函数:

doc.xpath("//a[starts-with(text(),'some text')]")

Because I can't stand lxml's approach to namespaces, I wrote a little method that you cam bind to the HtmlElement class.

Just import HtmlElement :

from lxml.etree import HtmlElement

Then put this in your file:

# Patch the HtmlElement class to add a function that can handle regular
# expressions within XPath queries.
def re_xpath(self, path):
    return self.xpath(path, namespaces={
        're': 'http://exslt.org/regular-expressions'})
HtmlElement.re_xpath = re_xpath

And then when you want to make a regular expression query, just do:

my_node.re_xpath("//a[re:match(text(), 'some text')]")

And you're off to the races. With a little more work, you could probably modify this to replace the xpath method itself, but I haven't bothered since this is working well enough.

why don't you just use xpath method starts-with here. you can use this to select specific elements that has text starting with your word something like

doc.xpath("//a[starts-with(text(),'some text')]")

note that if you want to select this element as well

<a href="www.example.com">ends with some text2</a>

its text is not starting with some text but it can be included as well using contains method something like

doc.xpath("//a[contains(text(),'some text')]")

The answer is :

doc.xpath("//a[starts-with(text(), 'some')]")

This is the simplest. Usually the simplest is the fast and best.

Suppose we have the following xml and we read it to doc .

from lxml import etree
s="""
<html>
<head><title>Page Title</title></head>
<body>
    <a href="www.example.com">some text</a>
    <a href="www.example.com">some text2</a>
    <a href="www.example.com">ends with some text2</a>
    <a href="www.example.com">other text1</a>
    <a href="www.example.com">other text2</a>
</body>
</html>
"""
doc=etree.fromstring(s)

We than test the speed of the three ways mentioned in previous answers.

time statement
39.8 µs doc.xpath("//a[re:match(text(), '^some')]", namespaces={'re': 'http://exslt.org/regular-expressions'})
29.3 µs doc.xpath("//a[re:test(text(), '^some')]", namespaces={'re': 'http://exslt.org/regular-expressions'})
16.7 µs doc.xpath("//a[starts-with(text(), 'some')]")

According to the official website here , re:match return an object while re:test only returns a boolean. My guess is re:match must be more complicate than re:test . And when the return value is an object instead of a boolean, more space/memory is needed and so it takes more time to allocate the memory. That is why re:test is faster than re:match . So I am thinking if you just want to check whether a string match a pattern, re:test is enough. Another regular expression function is replace. If you are like me who uses xpath massively in work you should read the document thoroughly likewise. This answers the title of this question, how to use regular expression in lxml xpath.

But regular expression should only be used when simple string functions cannot solve the problem. In your specific case, all that you need is the starts-with function. The time complicity is only O(n), n is the length of the second string. While using regular expression, the algorithm is more complicated. Thus more time is spent.

More about this topic:

from xpath 2.0, regular expression will be available without using exslt. But lxml only support xpath 1.0.

here is the w3 website:

https://www.w3.org/TR/xpath-functions/#string.match

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM