[英]How to scrape multiple google pages with Python and BeautifulSoup
我寫了一個可以抓取谷歌新聞搜索結果的代碼。 但它總是只刮第一頁。 如何編寫一個允許我抓取前 2,3...n 頁的循環?
我知道在url
我需要為頁面添加參數,並將所有內容放入for loop
中,但我不知道如何?
此代碼為我提供了第一個搜索頁面的標題、段落和日期:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)# i know that I need to add this parameter for page, but I do not know how
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
headline_text = soup.find_all('h3', class_= "r dO0Ag")
snippet_text = soup.find_all('div', class_='st')
news_date = soup.find_all('div', class_='slp')
此外, google news
和頁面的這種邏輯是否可以應用於例如bing news
或yahoo news
,我的意思是,我可以使用相同的參數還是url
不同?
我認為你需要改變你的 url。試試下面的代碼看看這是否有效。
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
page=0
while True:
url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
print(url)
response = requests.get(url, headers=headers,verify=False)
if response.status_code!=200:
break
soup = BeautifulSoup(response.text, 'html.parser')
headline_text = soup.find_all('h3', class_= "r dO0Ag")
snippet_text = soup.find_all('div', class_='st')
news_date = soup.find_all('div', class_='slp')
page=page+10
在線 IDE 中的代碼和完整示例進行測試:
from bs4 import BeautifulSoup
import requests, urllib.parse
def paginate(url, previous_url=None):
# Break from infinite recursion
if url == previous_url: return
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
# First page
yield soup
next_page_node = soup.select_one('a#pnnext')
# Stop when there is no next page
if next_page_node is None: return
next_page_url = urllib.parse.urljoin('https://www.google.com/',
next_page_node['href'])
# Pages after the first one
yield from paginate(next_page_url, url)
def scrape():
pages = paginate(
"https://www.google.com/search?hl=en-US&q=coca+cola&tbm=nws")
for soup in pages:
print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
print()
for data in soup.findAll('div', class_='dbsr'):
title = data.find('div', class_='JheGif nDgy9d').text
link = data.a['href']
print(f'Title: {title}')
print(f'Link: {link}')
print()
或者,您可以使用來自 SerpApi 的Google 新聞結果 API來實現相同的目的。 這是一個付費的 API 和免費計划。
您的情況的不同之處在於它支持多個搜索引擎,並且設置過程快速而直接。 您不必維護解析器或弄清楚如何繞過 Google 或其他引擎的塊或如何提取某些元素,因為它已經為最終用戶完成。
要集成的代碼:
# https://github.com/serpapi/google-search-results-python
from serpapi import GoogleSearch
import os
def scrape():
params = {
"engine": "google",
"q": "gta san andreas",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}")
for news_result in result["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
PS-我寫了一篇關於如何抓取Google News的更詳細的博客文章。
免責聲明,我為 SerpApi 工作。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.