使用BeautifulSoup FindAll进行网络爬取

Question

I want to download the hrefs of the 4 articles right above NEED TO KNOW on the following website: 我想在以下网站上“需要了解”上方下载4篇文章的hrefs：

http://www.marketwatch.com/ http://www.marketwatch.com/

but I cannot identify them uniquely with FindAll. 但是我无法通过FindAll唯一地标识它们。 The following approaches give me the articles, but also a bunch of others, that also fit those criteria. 以下方法为我提供了符合这些条件的文章，但也提供了许多其他文章。

trend_articles  = soup1.findAll("a", {"class": "link"})
href= article.a["href"]

trend_articles  = soup1.findAll("div", {"class": "content--secondary"})
href= article.a["href"]

Does someone have a suggestion, how I can get those 4, and only those 4 articles? 有人有建议吗，我怎么才能得到那4条，只有那4条？

Answer 1

It seems works for me: 看来对我有用：

from bs4 import BeautifulSoup
import requests

page = requests.get("http://www.marketwatch.com/").content
soup = BeautifulSoup(page, 'lxml')
header_secondare = soup.find('header', {'class': 'header--secondary'})
trend_articles = header_secondare.find_next_siblings('div', {'class': 'group group--list '})[0].findAll('a')

trend_articles = [article.contents[0] for article in trend_articles]
print(trend_articles)

使用BeautifulSoup FindAll进行网络爬取

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-04-10 04:03:14

使用BeautifulSoup FindAll进行网络爬取

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-04-10 04:03:14

解决方案1
4 已采纳 2017-04-10 04:03:14