简体   繁体   English

使用python美丽汤从html提取特定标签

[英]extract specific tags from html using python beautiful soup

From the following html 从下面的html

html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'

I wanted to get 我想得到

['Medellin','Colombia']

So far I've got the following code 到目前为止,我有以下代码

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.text for el in soup3.find_all('a')]

Which produces 哪个产生

['[1]', 'Medellín', 'Colombia']

however the first item to sup class as well, and I don't want it. 但是,也是上课的第一项,我不想要。

Could you provide clues? 你能提供线索吗?

I don't want to reference the 2nd and 3rd positions of the list, since I don't if other htmls would have the 1st position ([1]0 我不想引用列表的第二和第三位置,因为我不希望其他html都具有第一位置([1] 0

For this pattern of code: 对于这种代码模式:

<tr>
    <th scope="row">Born</th>
    <td>
        <span style="display:none"> (<span class="bday">1994-01-28</span>) </span>
        28 January 1994
        <span class="noprint ForceAgeToShow"> (age 23)</span>
        <sup class="reference" id="cite_ref-buenamusica_1-0">
            <a href="#cite_note-buenamusica-1">[1]</a>
        </sup>
        <br/>
        <a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>,
        <a href="/wiki/Colombia" title="Colombia">Colombia</a>
    </td>
</tr>

You could try to use a more specific selector, for example: 您可以尝试使用更具体的选择器,例如:

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr>td>a')
[el.text for el in spans]

or 要么

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr')
[el.text for el in spans.find_all('td>a')]

The information you are insterested in seems to also be present in the title attribute. 您感兴趣的信息似乎也出现在title属性中。 You could try it instead of text and discard the entries where it is None . 您可以尝试使用它来代替text并丢弃它为None的条目。

from bs4 import BeautifulSoup

html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.get('title') for el in soup3.find_all('a') if el.get('title') is not None]
# ['Medellín', 'Colombia']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM