简体   繁体   中英

Find text using lxml etree

I'm trying to get a text from one tag using lxml etree .

<div class="litem__type">
            <div>
                Robbp
            </div>


                    <div>Estimation</div>

                 +487 (0)639 14485653


                                •
                                <a href="mailto:herbrich@gmail.com">
                                    Email Address
                                </a>



                    •
                    <a class="external" href="http://www.google.com">
                        Homepage
                    </a>


        </div>

The problem is that I can't locate it because there are many differences between this kind of snippets. There are situations, when the first and second div is not there at all. As you can see, the telephone number is not in it's own div .

I suppose that it would be possible to extract the telephone using BeautifulSoups contents but I'm trying to use lxml module's xpath .

Do you have any ideas? (email don't have to be there sometimes)

EDIT: The best idea is probably to use regex but I don't know how to tell it that it should extract just text between two <div></div>

You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees.

The text after element A's closing tag, but before element B's opening tag, is called element A's tail text. To select this tail text using lxml etree you could do the following:

content = '''
<div class="litem__type">
    <div>Robbp</div>
    <div>Estimation</div>
    +487 (0)639 14485653
    <a href="mailto:herbrich@gmail.com">Email Address</a>
    <a class="external" href="http://www.google.com">Homepage</a>
</div>'''

from lxml import etree

tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)

Output

'+487 (0)639 14485653'

The strip() function is used here to remove whitespace on either side of the tail text.

You can iterate and get text after div tag.

from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
    # you can check here if it is a phone number
    print node.tail

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM