From the following html
html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'
I wanted to get
['Medellin','Colombia']
So far I've got the following code
soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.text for el in soup3.find_all('a')]
Which produces
['[1]', 'Medellín', 'Colombia']
however the first item to sup class as well, and I don't want it.
Could you provide clues?
I don't want to reference the 2nd and 3rd positions of the list, since I don't if other htmls would have the 1st position ([1]0
For this pattern of code:
<tr>
<th scope="row">Born</th>
<td>
<span style="display:none"> (<span class="bday">1994-01-28</span>) </span>
28 January 1994
<span class="noprint ForceAgeToShow"> (age 23)</span>
<sup class="reference" id="cite_ref-buenamusica_1-0">
<a href="#cite_note-buenamusica-1">[1]</a>
</sup>
<br/>
<a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>,
<a href="/wiki/Colombia" title="Colombia">Colombia</a>
</td>
</tr>
You could try to use a more specific selector, for example:
soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr>td>a')
[el.text for el in spans]
or
soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr')
[el.text for el in spans.find_all('td>a')]
The information you are insterested in seems to also be present in the title
attribute. You could try it instead of text
and discard the entries where it is None
.
from bs4 import BeautifulSoup
html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'
soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.get('title') for el in soup3.find_all('a') if el.get('title') is not None]
# ['Medellín', 'Colombia']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.