[英]BeautifulSoup "AttributeError: 'NoneType' object has no attribute 'text'"
I was web-scraping weather-searched Google with bs4, and Python can't find a <span>
tag when there is one.我正在用 bs4 进行网页抓取天气搜索谷歌,而 Python 找不到
<span>
标签。 How can I solve this problem?我怎么解决这个问题?
I tried to find this <span>
with the class
and the id
, but both failed.我试图用
class
和id
找到这个<span>
,但都失败了。
<div id="wob_dcp">
<span class="vk_gy vk_sh" id="wob_dc">Clear with periodic clouds</span>
</div>
Above is the HTML code I was trying to scrape in the page :以上是我试图在页面中抓取的 HTML 代码:
response = requests.get('https://www.google.com/search?hl=ja&ei=coGHXPWEIouUr7wPo9ixoAg&q=%EC%9D%BC%EB%B3%B8+%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E+%EB%82%B4%EC%9D%BC+%EB%82%A0%EC%94%A8&oq=%EC%9D%BC%EB%B3%B8+%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E+%EB%82%B4%EC%9D%BC+%EB%82%A0%EC%94%A8&gs_l=psy-ab.3...232674.234409..234575...0.0..0.251.929.0j6j1......0....1..gws-wiz.......35i39.yu0YE6lnCms')
soup = BeautifulSoup(response.content, 'html.parser')
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
But failed with this code, the error is:但是这个代码失败了,错误是:
Traceback (most recent call last):
File "C:\Users\sungn_000\Desktop\weather.py", line 23, in <module>
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
AttributeError: 'NoneType' object has no attribute 'text'
Please solve this error.请解决这个错误。
This is because the weather section is rendered by the browser via JavaScript.这是因为天气部分是由浏览器通过 JavaScript 呈现的。 So when you use
requests
you only get the HTML content of the page which doesn't have what you need.因此,当您使用
requests
您只会获得没有您需要的页面的 HTML 内容。 You should use for example selenium
(or requests-html
) if you want to parse page with elements rendered by web browser.如果您想使用由 Web 浏览器呈现的元素来解析页面,您应该使用例如
selenium
(或requests-html
)。
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://www.google.com/search?hl=en&ei=coGHXPWEIouUr7wPo9ixoAg&q=%EC%9D%BC%EB%B3%B8%20%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E%20%EB%82%B4%EC%9D%BC%20%EB%82%A0%EC%94%A8&oq=%EC%9D%BC%EB%B3%B8%20%E6%A1%9C%E5%B7%9D%E5%B8%82%E7%9C%9F%E5%A3%81%E7%94%BA%E5%8F%A4%E5%9F%8E%20%EB%82%B4%EC%9D%BC%20%EB%82%A0%EC%94%A8&gs_l=psy-ab.3...232674.234409..234575...0.0..0.251.929.0j6j1......0....1..gws-wiz.......35i39.yu0YE6lnCms')
soup = BeautifulSoup(response.content, 'html.parser')
tomorrow_weather = soup.find('span', {'id': 'wob_dc'}).text
print(tomorrow_weather)
Output:输出:
pawel@pawel-XPS-15-9570:~$ python test.py
Clear with periodic clouds
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(a)
>>> a
'<div id="wob_dcp">\n <span class="vk_gy vk_sh" id="wob_dc">Clear with periodic clouds</span> \n</div>'
>>> soup.find("span", id="wob_dc").text
'Clear with periodic clouds'
Try this out.试试这个。
It's not rendered via JavaScript as pawelbylina mentioned, and you don't have to use requests-html
or selenium
since everything needed is in the HTML, and it will slow down the scraping process a lot because of page rendering.它不是通过pawelbylina提到的 JavaScript 呈现的,并且您不必使用
requests-html
或selenium
因为所需的一切都在 HTML 中,并且由于页面呈现,它会大大减慢抓取过程。
It could be because there's no user-agent
specified thus Google blocks your request and you receiving a different HTML with some sort of error because the default requests
user-agent
is python-requests .这可能是因为没有指定
user-agent
因此 Google 阻止了您的请求,并且您收到了带有某种错误的不同 HTML,因为默认requests
user-agent
是 python-requests 。 Google understands it and blocks a request since it's not the "real" user visit.谷歌理解它并阻止请求,因为它不是“真正的”用户访问。Checks what's your
user-agent
.检查您的
user-agent
是什么。
Pass user-agent
intro request headers:传递
user-agent
介绍请求标头:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
You're looking for this, use select_one()
to grab just one element:你正在寻找这个,使用
select_one()
只抓取一个元素:
soup.select_one('#wob_dc').text
Have a look at SelectorGadget Chrome extension to grab CSS
selectors by clicking on the desired elements in your browser.查看SelectorGadget Chrome 扩展程序,通过单击浏览器中的所需元素来获取
CSS
选择器。
Code and full example that scrapes more in the online IDE : 在在线 IDE 中抓取更多内容的代码和完整示例:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "일본 桜川市真壁町古城 내일 날씨",
"hl": "en",
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
location = soup.select_one('#wob_loc').text
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Location: {location}\n'
f'Weather condition: {weather_condition}\n'
f'Temperature: {tempature}°F\n'
f'Precipitation: {precipitation}\n'
f'Humidity: {humidity}\n'
f'Wind speed: {wind}\n'
f'Current time: {current_time}\n')
------
'''
Location: Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Weather condition: Cloudy
Temperature: 79°F
Precipitation: 40%
Humidity: 81%
Wind speed: 7 mph
Current time: Saturday
'''
Alternatively, you can achieve the same thing by using the Direct Answer Box API from SerpApi.或者,您可以使用 SerpApi 的Direct Answer Box API来实现相同的目的。 It's a paid API with a free plan.
这是一个带有免费计划的付费 API。
The difference in your case is that you don't have to think about how to bypass block from Google or figure out why data from certain elements aren't extracting as it should since it's already done for the end-user.您的情况的不同之处在于,您不必考虑如何绕过 Google 的阻止或弄清楚为什么某些元素的数据没有按预期提取,因为它已经为最终用户完成了。 The only thing that needs to be done is to iterate over structured JSON and grab the data you want.
唯一需要做的就是迭代结构化 JSON 并获取您想要的数据。
Code to integrate:集成代码:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "일본 桜川市真壁町古城 내일 날씨",
"api_key": os.getenv("API_KEY"),
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}°F\n{precipitation}\n{humidity}\n{wind}\n')
--------
'''
Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Saturday
Cloudy
79°F
40%
81%
7 mph
'''
Disclaimer, I work for SerpApi.
免责声明,我为 SerpApi 工作。
I also had this problem.我也有这个问题。 You should not import like this
你不应该像这样导入
from bs4 import BeautifulSoup
you should import like this你应该像这样导入
from bs4 import *
This should work.这应该有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.