_Scrape_ 块引用 bs4 后的文本

Question

我在 HTML 中有这样的东西：

<p align="left"><strong><tt>
        some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt><br/>
        some text:</tt></strong><tt>, (19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>

我在 Python 的代码：

page = requests.get(site)
soup = BeautifulSoup(page.content, 'html.parser')
rounds = soup.find('p', align="left")
matches_links = rounds.find_all('a')

我得到了一些评论和文本的所有链接。 在</blockquote></blockquote>之后我什么也得不到。 这两个块引用在页面代码中是不可见的，只有当我调试我的 Python 代码时我才能在soup中看到它。 在soup中，我有所有 HTML 代码，但在rounds中，代码以<tt>text after comment</tt></p>结尾。

有什么方法可以获得“我想要的链接”和“我想要的文字”？

Answer 1

如果您查看 HTML 代码，您会发现在</blockquote></blockquote>之前有</p> > 。 这意味着您的可变rounds不包含您想要的链接。 在此<p>标记之后搜索下一个<a> ：

from bs4 import BeautifulSoup


txt = '''
<p align="left"><strong><tt>
        some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt><br/>
        some text:</tt></strong><tt>, (19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>
'''

soup = BeautifulSoup(txt, 'html.parser')

matched_link = soup.select_one('p[align="left"] ~ a')
print(matched_link)

印刷：

<a href="link i want"><tt>text i want</tt></a>

_Scrape_ 块引用 bs4 后的文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-19 10:14:18

_Scrape_ 块引用 bs4 后的文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-19 10:14:18

解决方案1
1 已采纳 2020-08-19 10:14:18