简体   繁体   中英

python, lxml or etree to get a parent of a node containing some text

how can i proceed to get the parent node of a node containing a piece of text?

moreover can i use some regexp mecanism as the matched element for searching/filtering, below searching from re.compile("th[ei]s? .ne") for example?

say this one

html = '''<html>
<head><title></title></head>
<body>
<table>
<tr><td>1a</td><td>2a</td><td>3a</td><td>4a</td><td>5a</td><td>6a</td></tr>
<tr><td>1b</td><td>2b</td><td>3b</td><td>4b</td><td>5b</td><td>6b</td></tr>
<tr><td>1c</td><td>2c</td><td>3c</td><td>4c</td><td>5c</td><td>6c this one</td></tr>
</table>
<div><div>
<table>
<tr><td>1A</td><td>2A</td><td>3A</td><td>4A</td><td>5A</td><td>6A</td></tr>
<tr><td>1B</td><td>2B</td><td>3B</td><td>4B</td><td>5B</td><td>6B</td></tr>
<tr><td>1C</td><td>2C</td><td>3C</td><td>4C</td><td>5C</td><td>6C</td></tr>
</table>this one
</div></div>
</body>
</html>'''

i would like to have an iterator that return:

<td>6c this one</td>

and then:

<div>
<table>
<tr><td>1A</td><td>2A</td><td>3A</td><td>4A</td><td>5A</td><td>6A</td></tr>
<tr><td>1B</td><td>2B</td><td>3B</td><td>4B</td><td>5B</td><td>6B</td></tr>
<tr><td>1C</td><td>2C</td><td>3C</td><td>4C</td><td>5C</td><td>6C</td></tr>
</table>this one
</div>

i tried:

import lxml.html
root = lxml.html.document_fromstring(html)
root.xpath("//text()[contains(., one)]")

and

import xml.etree.ElementTree as ET
for e in ET.fromstring(html).getiterator():
    if e.text and e.text.find('one') != -1:
        print "Found string %r, element = %r" % (e.text, e)

but the best i can have is the node containing this one itself... while i am looking for the parent containing this text. notice that div or table are only for example, i really need to go backward to the parent after finding "this one" rather than filtering xml element containing this one because i will not know that this is a div, a table or anything before finding what it contains.

(notice also that it is html and not well formated xml, as i suppose que the second this one should have been wrapped in a xml tag)

EDIT:

>>> root.xpath("//*[contains(child::*/text(), 'one')]") # why empty parent?
[]
>>> root.xpath("//*[contains(text(), 'one')]") # i expected to have a list with two elements td and div
[<Element td at 0x280b600>]
>>> root.xpath("//*[child::*[contains(text(), 'one')]]") # if parent: expected tr and div, if not parent expected table or div, still missing one
[<Element tr at 0x2821f30>]

BTW, using the last is ok:

import xml.etree.ElementTree as ET
import lxml.html
#[... here add html = """...]
root = lxml.html.document_fromstring(html)
for i, x in enumerate(root.xpath("//text()[contains(., 'one')]/parent::*")):
    print "%s => \n\t" % i, ET.tostring(x).replace("\n", "\n\t")

produce:

0 => 
    <td>6c this one</td>
1 => 
    <div>
    <table>
    <tr><td>1A</td><td>2A</td><td>3A</td><td>4A</td><td>5A</td><td>6A</td></tr>
    <tr><td>1B</td><td>2B</td><td>3B</td><td>4B</td><td>5B</td><td>6B</td></tr>
    <tr><td>1C</td><td>2C</td><td>3C</td><td>4C</td><td>5C</td><td>6C</td></tr>
    </table>this one
    </div>

Based on your example output it seems like you want to get the element which contains the specified text one . Your description says you want the parent of this node.

Based on this assumption you can get the desired nodes using the following XPath:

//*[contains(text(), 'one')]

If you really want the parent of this node, you can do

//*[child::*[contains(text(), 'one')]]

By the way, as you can see I used a predicate to get the node, so I filtered the XML nodes. In my opinion, this is the more logical and readable approach, as it basically say Give me all the nodes which fulfill the given condition rather than saying Give me the output of my condition and from this point on search for the actually desired output . But you could also do something like the following, which would better match your proposed solution:

//text()[contains(., 'one')]/parent::*
>>> root.xpath("//*[contains(child::*/text(), 'one')]") # why empty parent?
[]

This XPath expression selects every element for which the first grandchild text node contains 'one'. The first argument to contains() is expected to be a string, so XPath takes the first node in the result of child::*/text() and takes its string value. Since no element has a text node containing "one" as its first grandchild, the answer is an empty nodelist.

>>> root.xpath("//*[contains(text(), 'one')]")
# i expected to have a list with two elements td and div
[<Element td at 0x280b600>]

For the same reason, this XPath expression selects all elements whose first text node child contains 'one'. That's why the <td> is selected, but the <div> isn't: the div's child text node containing 'one' is not its first child text node.

>>> root.xpath("//*[child::*[contains(text(), 'one')]]")
# if parent: expected tr and div,
# if not parent expected table or div, still missing one
[<Element tr at 0x2821f30>]

This faces the same limitation as the previous expression.

Have you tried the last solution that @dirkk proposed,

//text()[contains(., 'one')]/parent::*

That should avoid your problem with passing multiple nodes as the first argument to contains() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM