[英]Can't parse a Google search result page using BeautifulSoup
我在 python 中使用來自 bs4 的 BeautifulSoup 解析網頁。 當我檢查 google 搜索頁面的元素時,這是第一個結果的部門:
因為它有class = 'r'
我寫了這段代碼:
import requests
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
但是命令提示符只返回[]
可能出了什么問題以及如何糾正它?
另外, 這是網頁。
編輯 1:我通過添加標題字典相應地編輯了我的代碼,但結果是相同的[]
。 這是新代碼:
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%22cams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%22scams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
注意:當我告訴它打印整個頁面時,沒有問題,或者當我使用list(page.children)
,它工作正常。
某些網站需要設置User-Agent
標頭以防止來自非瀏覽器的虛假請求。 但是,幸運的是有一種方法可以將標頭傳遞給請求
# Define a dictionary of http request headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)
注意:可以在此處找到用戶代理列表
>>> give_me_everything = soup.find_all('div', class_='yuRUbf')
Prints a bunch of stuff.
>>> give_me_everything_v2 = soup.select('.yuRUbf')
Prints a bunch of stuff.
請注意,您不能執行以下操作:
>>> give_me_everything = soup.find_all('div', class_='yuRUbf').text
AttributeError: You're probably treating a list of elements like a single element.
>>> for all in soup.find_all('div', class_='yuRUbf'):
print(all.text)
Prints a bunch of stuff.
代碼:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q="narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
give_me_everything = soup.find_all('div', class_='yuRUbf')
print(give_me_everything)
或者,您可以使用來自 SerpApi 的Google Search Engine Results API做同樣的事情。 這是一個付費 API,可免費試用 5,000 次搜索。
主要區別在於,當某些東西不起作用時,您不必提供不同的解決方案,因此不必維護解析器。
集成代碼:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": 'narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav',
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
print(f'{title}\n{link}\n{displayed_link}\n')
----------
Opposition Corners Modi Govt On Jay Shah Issue, Rafael ...
https://www.outlookindia.com/website/story/no-confidence-vote-opposition-corners-modi-govt-on-jay-shah-issue-rafael-deals-c/313790
https://www.outlookindia.com
Modi, Rahul and Kejriwal describe one another as frauds ...
https://www.business-standard.com/article/politics/modi-rahul-and-kejriwal-describe-one-another-as-frauds-114022400019_1.html
https://www.business-standard.com
...
免責聲明,我為 SerpApi 工作。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.