简体   繁体   中英

extract specific tags from html using python beautiful soup

From the following html

html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'

I wanted to get

['Medellin','Colombia']

So far I've got the following code

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.text for el in soup3.find_all('a')]

Which produces

['[1]', 'Medellín', 'Colombia']

however the first item to sup class as well, and I don't want it.

Could you provide clues?

I don't want to reference the 2nd and 3rd positions of the list, since I don't if other htmls would have the 1st position ([1]0

For this pattern of code:

<tr>
    <th scope="row">Born</th>
    <td>
        <span style="display:none"> (<span class="bday">1994-01-28</span>) </span>
        28 January 1994
        <span class="noprint ForceAgeToShow"> (age 23)</span>
        <sup class="reference" id="cite_ref-buenamusica_1-0">
            <a href="#cite_note-buenamusica-1">[1]</a>
        </sup>
        <br/>
        <a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>,
        <a href="/wiki/Colombia" title="Colombia">Colombia</a>
    </td>
</tr>

You could try to use a more specific selector, for example:

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr>td>a')
[el.text for el in spans]

or

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr')
[el.text for el in spans.find_all('td>a')]

The information you are insterested in seems to also be present in the title attribute. You could try it instead of text and discard the entries where it is None .

from bs4 import BeautifulSoup

html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.get('title') for el in soup3.find_all('a') if el.get('title') is not None]
# ['Medellín', 'Colombia']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM