发现标签时python BeautifulSoup4中断循环

Question

我在通过bs4进入html时打破for循环时遇到问题。 我想保存一个用标题分开的列表。 HTML代码如下所示，但是它包含所需标签之间的更多信息：

<h2>List One</h2>
<td class="title">
    <a title="Title One">This is Title One</a>
</td>
<td class="title">
    <a title="Title Two">This is Title Two</a>
</td>
<h2>List Two</h2>
<td class="title">
    <a title="Title Three">This is Title Three</a>
</td>
<td class="title">
    <a title="Title Four">This is Title Four</a>
</td>

我想这样打印结果：

List One
This is Title One
This is Title Two
List Two
This is Title Three
This is Title Four

我用我的脚本走了这么远：

import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('some webiste')
soup = BeautifulSoup(html, "lxml")

quote1 = soup.h2
print quote1.text

quote2 = quote1.find_next_sibling('h2')
print quote2.text

for quotes in soup.findAll('h2'):
    if quotes.find(text=True) == quote2.text:
        break
    if quotes.find(text=True) == quote1.text:
        for anchor in soup.findAll('td', {'class':'title'}):
            print anchor.text
            print quotes.text

当试图找到“ quote2”（清单二）时，我试图打破循环。 但是脚本会获取所有td内容，而忽略下一个h2标签。 那么，如何在下一个h2-tag中中断for循环？

Answer 1

我认为问题出在您的HTML语法中。 根据https://validator.w3.org的说法，混用“ td”和“ h3”（或通常的任何标头标记）是不合法的。 同样，用表实现列表很可能不是一个好习惯。

如果您可以操作输入文件，则可以使用“ ul”和“ li”标签（“ ul”中的第一个“ li”包含标题）来实现您似乎需要的列表，或者，如果需要使用表，只需将您的标头位于“ td”标签内，甚至可以更清晰地包含“ th”标签：

<table>
<tr>
    <th>Your title</th>
</tr>
<tr>
    <td>Your data</td>
</tr>
</table>

如果输入不受您的控制，则脚本仍然可以执行搜索并替换输入文本，将标题放入表格单元格或列表项中。

发现标签时python BeautifulSoup4中断循环

问题描述

1 个解决方案

解决方案1
0 2015-12-07 13:38:27

发现标签时python BeautifulSoup4中断循环

问题描述

1 个解决方案

解决方案1 0 2015-12-07 13:38:27

解决方案1
0 2015-12-07 13:38:27