简体   繁体   English

使用Python和Beautiful Soup进行Web抓取

[英]Web scraping with Python and Beautiful Soup

I am practicing building web scrapers. 我正在练习构建网络刮刀。 One that I am working on now involves going to a site, scraping links for the various cities on that site, then taking all of the links for each of the cities and scraping all the links for the properties in said cites. 我现在正在进行的工作包括访问一个站点,为该站点上的各个城市抓取链接,然后获取每个城市的所有链接,并在所述城市中抓取所有属性的链接。

I'm using the following code: 我正在使用以下代码:

import requests

from bs4 import BeautifulSoup

main_url = "http://www.chapter-living.com/"

# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title")  # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags.find_all("a")]  # Links to cities

If I print out city_tags I get the HTML I want. 如果我打印出city_tags我会得到我想要的HTML。 However, when I print cities_links I get AttributeError: 'ResultSet' object has no attribute 'find_all' . 但是,当我打印cities_links我得到了AttributeError: 'ResultSet' object has no attribute 'find_all'

I gather from other q's on here that this error occurs because city_tags returns none, but this can't be the case if it is printing out the desired html? 我在这里从其他q收集这个错误是因为city_tags没有返回,但是如果打印出所需的html,则不会出现这种情况? I have noticed that said html is in [] - does this make a difference? 我注意到所说的html在[] - 这有什么不同吗?

Well city_tags is a bs4.element.ResultSet (essentially a list) of tags and you are calling find_all on it. 那么city_tags是一个标签的bs4.element.ResultSet (基本上是一个列表),你在它上面调用了find_all。 You probably want to call find_all in every element of the resultset or in this specific case just retrieve their href attribute 您可能希望在结果集的每个元素中调用find_all,或者在此特定情况下只需检索其href属性

import requests
from bs4 import BeautifulSoup

main_url = "http://www.chapter-living.com/"

# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title")  # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags]  # Links to cities

As the error says, the city_tags is a ResultSet which is a list of nodes and it doesn't have the find_all method, you either have to loop through the set and apply find_all on each individual node or in your case, I think you can simply extract the href attribute from each node: 正如错误所说, city_tags是一个ResultSet,它是一个节点列表,它没有find_all方法,你要么必须遍历集合并在每个单独的节点上应用find_all ,或者在你的情况下,我认为你可以只需从每个节点中提取href属性:

[tag['href'] for tag in city_tags]

#['https://www.chapter-living.com/blog/',
# 'https://www.chapter-living.com/testimonials/',
# 'https://www.chapter-living.com/events/']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM