Previous and next sibling issue - Python - BS4

Question

Sorry to bother you out with such a simple question, but I'm losing my mind with it.

I'm trying to get one particular information from the following HTML. In this case, I want the XXXX (a text, to be more specific)

    <div id="links">
        <h3 id="financial">
            Financial S<span class="linktype">Commer</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Ea</a> | xxxxx<br/>
        <a href="http:" target="_blank">We</a> | xxxx<br/>
        <a href="http:" target="_blank">HQ</a> | xxxxx<br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="services">
            Services<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">To</a> | xxxx <br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="dr">
            Dr<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="physical">
            Phys<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>

I'm using BS4 to work with it:

    for x in xpto:
        titulo = x.text #to get the Name link. Worked
        link = str(x.get("href")) #To get just the link. Worked too.
        print(titulo)
        print(link)

My issue is how to get just the XXXXX, with is kind of a description of the link. As you can see, it's not inside 'a', but after the "|" element and, I think, before de "br/" (with, btw, I didn't understand why there is a "br/" if there in no "br" before to open it. Is that normal?)

I tried working with previous and next sibling.

    for x in xpto:
        desc = x.parent.find_next_sibling('a')
        desc2 = x.parent.find_previous_sibling('b')
        print(desc)
        print(desc2)

Both are giving me back 'None' as result. Does anyone know what is happening?

Update

A want to do the loop with the other one. Something like this;

    for x in xpto:
        titulo = x.text #to get the Name link. Worked
        link = str(x.get("href")) #To get just the link. Worked too.
        desc = x.parent.find_next_sibling('a')
        print(titulo)
        print(desc)
        print(link)

I've done the xpto object like this

    xpto = links.find_all(['h3', 'a']) #with works with the title and link.

To be able to run the desc object I think I should change de xpto to something like this:

    xpto = links.find_all(['h3', 'a'], a.next.next.strip(' |')) #it would include the thing and after I would be able to do the loop. But I have no idea how to do such a complex findAll.

Sorry, guys. Web scraping is really something hard!

Thank you for your help =D

btw: python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) Macbook Sierra 10.12.6

Answer 1

You could just use next twice, and then strip off the part of the text you don't want. For example:

from bs4 import BeautifulSoup

html = """
    <div id="links">
        <h3 id="financial">
            Financial S<span class="linktype">Commer</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Ea</a> | xxxxx<br/>
        <a href="http:" target="_blank">We</a> | xxxx<br/>
        <a href="http:" target="_blank">HQ</a> | xxxxx<br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="services">
            Services<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">To</a> | xxxx <br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="dr">
            Dr<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="physical">
            Phys<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>"""

soup = BeautifulSoup(html, "html.parser")
div = soup.find('div', id='links')

for el in div.find_all(['a', 'h3']):
    if el.name == 'a':
        if 'target' in el.attrs:        # Only 'a' tags with target
            print("link text '{}', link '{}', desc '{}'".format(el.text, el['href'], el.next.next.strip(' |\n')))
    else:
        el.span.clear()     # Remove 'Commercial Links' (if not needed)
        print("h3_title '{}'".format(el.get_text(strip=True)))

This would display:

h3_title 'Financial S'
link text 'Ea', link 'http:', desc 'xxxxx'
link text 'We', link 'http:', desc 'xxxx'
link text 'HQ', link 'http:', desc 'xxxxx'
h3_title 'Services'
link text 'To', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Dr'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Phys'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'

<br /> is sometimes seen, it is used with XHTML documents, <br> is more usual though.

Previous and next sibling issue - Python - BS4

Question

Update

1 answers

solution1
0 ACCPTED 2017-11-06 13:56:25

Previous and next sibling issue - Python - BS4

Question

Update

1 answers

solution1 0 ACCPTED 2017-11-06 13:56:25

solution1
0 ACCPTED 2017-11-06 13:56:25