![](/img/trans.png)
[英]BeautifulSoup, difference between soup() and soup.findAll()?
[英]Strange BeautifulSoup soup.findAll error: not working within functions
我正在尝试构建一个非常简单的刮板,以作为爬虫项目的一部分来收获链接。 我设置了以下功能来进行抓取:
import requests as rq
from bs4 import BeautifulSoup
def getHomepageLinks(page):
homepageLinks = []
response = rq.get(page)
text = response.text
soup = BeautifulSoup(text)
for a in soup.findAll('a'):
homepageLinks.append(a['href'])
return homepageLinks
我将此文件另存为“ scraper2.py”。 当我尝试运行代码时,出现以下错误:
>>> import scraper2 as sc
>>> sc.getHomepageLinks('http://washingtonpost.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "scraper2.py", line 9, in getHomepageLinks
for a in soup.findAll('a'):
TypeError: 'NoneType' object is not callable
现在,奇怪的是:如果我尝试调试代码并仅打印响应,则可以正常工作:
>>> response = rq.get('http://washingtonpost.com')
>>> text = response.text
>>> soup = BeautifulSoup(text)
>>> for a in soup.findAll('a'):
... print(a['href'])
...
https://www.washingtonpost.com
#
#
http://www.washingtonpost.com/politics/
https://www.washingtonpost.com/opinions/
http://www.washingtonpost.com/sports/
http://www.washingtonpost.com/local/
http://www.washingtonpost.com/national/
http://www.washingtonpost.com/world/
...
如果我正确地读取了错误消息,则问题出在汤.findAll,但是仅当findAll是函数的一部分时。 我确定我拼写正确(不是findall或Findall,因为这里的许多错误都在这里),而且我已经尝试使用上一篇文章中建议的lxml进行修复,但该问题并未解决。 有人有什么想法吗?
尝试将您的for循环替换为以下内容:
for a in soup.findAll('a'):
url = a.get("href")
if url != None:
homepageLinks.append(url)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.