简体   繁体   中英

Remove elements from tree based on list of terms

I'm trying to capture some text from a webpage (whose URL is passed when running the script), but its buried in a paragraph tag with no other attributes assigned. I can collect the contents of every paragraph tag, but I want to remove any elements from the tree that contain any of a list of keywords.

I get the following error:

tree.remove(elem) TypeError: Argument 'element' has incorrect type (expected lxml.etree._Element, got _ElementStringResult)

I understand that what I am getting back when I try to iterate through the tree is the wrong type, but how do I get the element instead?

Sample Code:

    #!/usr/bin/python

    from lxml import html
    from lxml import etree

    url = sys.argv[1]
    page = requests.get(url)
    tree = html.fromstring(page.content)

    terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
    paragraphs = tree.xpath('//p/text()')
    for elem in paragraphs:
        if any(term in elem for term in terms):
            tree.remove(elem) 

In your code, elem is an _ElementStringResult which has the instance method getparent . Its parent is an Element object of one of the <p> nodes.

The parent has a remove method which can be used to remove it from the tree:

element.getparent().remove(element)

I do not believe there is a more direct way and I don't have a good answer to why there isn't a removeself method.

Using the example html:

content = '''
<root>
    <p> nothing1 </p>
    <p> keyword1 </p>
    <p> nothing2 </p>
    <p> nothing3 </p>
    <p> keyword4 </p>
</root>
'''

You can see this in action in your code with:

from lxml import html
from lxml import etree

tree = html.fromstring(content)

terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
    if any(term in elem for term in terms):
        actual_element = elem.getparent() 
        actual_element.getparent().remove(actual_element)

for child in tree.getchildren():
    print('<{tag}>{text}</{tag}>'.format(tag=child.tag, text=child.text))

# Output:
# <p> nothing1 </p>
# <p> nothing2 </p>
# <p> nothing3 </p>

From the comments, it seems like this code isn't working for you. If so, you might need to provide more information about the structure of your html.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM