Sorry to bother you out with such a simple question, but I'm losing my mind with it.
I'm trying to get one particular information from the following HTML. In this case, I want the XXXX (a text, to be more specific)
<div id="links">
<h3 id="financial">
Financial S<span class="linktype">Commer</span>
</h3>
<hr/>
<a href="http:" target="_blank">Ea</a> | xxxxx<br/>
<a href="http:" target="_blank">We</a> | xxxx<br/>
<a href="http:" target="_blank">HQ</a> | xxxxx<br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="services">
Services<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">To</a> | xxxx <br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="dr">
Dr<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="physical">
Phys<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
I'm using BS4 to work with it:
for x in xpto:
titulo = x.text #to get the Name link. Worked
link = str(x.get("href")) #To get just the link. Worked too.
print(titulo)
print(link)
My issue is how to get just the XXXXX, with is kind of a description of the link. As you can see, it's not inside 'a', but after the "|" element and, I think, before de "br/" (with, btw, I didn't understand why there is a "br/" if there in no "br" before to open it. Is that normal?)
I tried working with previous and next sibling.
for x in xpto:
desc = x.parent.find_next_sibling('a')
desc2 = x.parent.find_previous_sibling('b')
print(desc)
print(desc2)
Both are giving me back 'None' as result. Does anyone know what is happening?
A want to do the loop with the other one. Something like this;
for x in xpto:
titulo = x.text #to get the Name link. Worked
link = str(x.get("href")) #To get just the link. Worked too.
desc = x.parent.find_next_sibling('a')
print(titulo)
print(desc)
print(link)
I've done the xpto object like this
xpto = links.find_all(['h3', 'a']) #with works with the title and link.
To be able to run the desc object I think I should change de xpto to something like this:
xpto = links.find_all(['h3', 'a'], a.next.next.strip(' |')) #it would include the thing and after I would be able to do the loop. But I have no idea how to do such a complex findAll.
Sorry, guys. Web scraping is really something hard!
Thank you for your help =D
btw: python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) Macbook Sierra 10.12.6
You could just use next
twice, and then strip off the part of the text you don't want. For example:
from bs4 import BeautifulSoup
html = """
<div id="links">
<h3 id="financial">
Financial S<span class="linktype">Commer</span>
</h3>
<hr/>
<a href="http:" target="_blank">Ea</a> | xxxxx<br/>
<a href="http:" target="_blank">We</a> | xxxx<br/>
<a href="http:" target="_blank">HQ</a> | xxxxx<br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="services">
Services<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">To</a> | xxxx <br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="dr">
Dr<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="physical">
Phys<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>"""
soup = BeautifulSoup(html, "html.parser")
div = soup.find('div', id='links')
for el in div.find_all(['a', 'h3']):
if el.name == 'a':
if 'target' in el.attrs: # Only 'a' tags with target
print("link text '{}', link '{}', desc '{}'".format(el.text, el['href'], el.next.next.strip(' |\n')))
else:
el.span.clear() # Remove 'Commercial Links' (if not needed)
print("h3_title '{}'".format(el.get_text(strip=True)))
This would display:
h3_title 'Financial S'
link text 'Ea', link 'http:', desc 'xxxxx'
link text 'We', link 'http:', desc 'xxxx'
link text 'HQ', link 'http:', desc 'xxxxx'
h3_title 'Services'
link text 'To', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Dr'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Phys'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
<br />
is sometimes seen, it is used with XHTML documents, <br>
is more usual though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.