使用python美丽汤从html提取特定标签

Question

From the following html 从下面的html

html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'

I wanted to get 我想得到

['Medellin','Colombia']

So far I've got the following code 到目前为止，我有以下代码

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.text for el in soup3.find_all('a')]

Which produces 哪个产生

['[1]', 'Medellín', 'Colombia']

however the first item to sup class as well, and I don't want it. 但是，也是上课的第一项，我不想要。

Could you provide clues? 你能提供线索吗？

I don't want to reference the 2nd and 3rd positions of the list, since I don't if other htmls would have the 1st position ([1]0 我不想引用列表的第二和第三位置，因为我不希望其他html都具有第一位置（[1] 0

Answer 1

For this pattern of code: 对于这种代码模式：

<tr>
    <th scope="row">Born</th>
    <td>
        <span style="display:none"> (<span class="bday">1994-01-28</span>) </span>
        28 January 1994
        <span class="noprint ForceAgeToShow"> (age 23)</span>
        <sup class="reference" id="cite_ref-buenamusica_1-0">
            <a href="#cite_note-buenamusica-1">[1]</a>
        </sup>
        <br/>
        <a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>,
        <a href="/wiki/Colombia" title="Colombia">Colombia</a>
    </td>
</tr>

You could try to use a more specific selector, for example: 您可以尝试使用更具体的选择器，例如：

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr>td>a')
[el.text for el in spans]

or 要么

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.select('tr')
[el.text for el in spans.find_all('td>a')]

Answer 2

The information you are insterested in seems to also be present in the title attribute. 您感兴趣的信息似乎也出现在title属性中。 You could try it instead of text and discard the entries where it is None . 您可以尝试使用它来代替text并丢弃它为None的条目。

from bs4 import BeautifulSoup

html='<tr><th scope="row">Born</th><td><span style="display:none"> (<span class="bday">1994-01-28</span>) </span>28 January 1994<span class="noprint ForceAgeToShow"> (age 23)</span><sup class="reference" id="cite_ref-buenamusica_1-0"><a href="#cite_note-buenamusica-1">[1]</a></sup><br/><a href="/wiki/Medell%C3%ADn" title="Medellín">Medellín</a>, <a href="/wiki/Colombia" title="Colombia">Colombia</a></td></tr>'

soup3=BeautifulSoup(html,'html.parser')
spans=soup3.findAll('tr')
[el.get('title') for el in soup3.find_all('a') if el.get('title') is not None]
# ['Medellín', 'Colombia']

使用python美丽汤从html提取特定标签

问题描述

2 个解决方案

解决方案1
1 2018-01-08 21:52:04

解决方案2
0 已采纳 2018-01-08 21:57:02

使用python美丽汤从html提取特定标签

问题描述

2 个解决方案

解决方案1 1 2018-01-08 21:52:04

解决方案2 0 已采纳 2018-01-08 21:57:02

解决方案1
1 2018-01-08 21:52:04

解决方案2
0 已采纳 2018-01-08 21:57:02