簡體   English   中英

如何使用 Python 和 BeautifulSoup 抓取多個谷歌頁面

[英]How to scrape multiple google pages with Python and BeautifulSoup

我寫了一個可以抓取谷歌新聞搜索結果的代碼。 但它總是只刮第一頁。 如何編寫一個允許我抓取前 2,3...n 頁的循環?

我知道在url我需要為頁面添加參數,並將所有內容放入for loop中,但我不知道如何?

此代碼為我提供了第一個搜索頁面的標題、段落和日期:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)# i know that I need to add this parameter for page, but I  do not know how

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

headline_text = soup.find_all('h3', class_= "r dO0Ag")

snippet_text = soup.find_all('div', class_='st')

news_date = soup.find_all('div', class_='slp')

此外, google news和頁面的這種邏輯是否可以應用於例如bing newsyahoo news ,我的意思是,我可以使用相同的參數還是url不同?

我認為你需要改變你的 url。試試下面的代碼看看這是否有效。

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
page=0


while True:
    url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
    print(url)

    response = requests.get(url, headers=headers,verify=False)
    if response.status_code!=200:
        break
    soup = BeautifulSoup(response.text, 'html.parser')

    headline_text = soup.find_all('h3', class_= "r dO0Ag")

    snippet_text = soup.find_all('div', class_='st')

    news_date = soup.find_all('div', class_='slp')
    page=page+10

在線 IDE 中的代碼和完整示例進行測試:

from bs4 import BeautifulSoup
import requests, urllib.parse

def paginate(url, previous_url=None):
    # Break from infinite recursion
    if url == previous_url: return

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
    }

    response = requests.get(url, headers=headers).text
    soup = BeautifulSoup(response, 'lxml')

    # First page
    yield soup

    next_page_node = soup.select_one('a#pnnext')

    # Stop when there is no next page
    if next_page_node is None: return

    next_page_url = urllib.parse.urljoin('https://www.google.com/',
                                         next_page_node['href'])

    # Pages after the first one
    yield from paginate(next_page_url, url)


def scrape():
    pages = paginate(
        "https://www.google.com/search?hl=en-US&q=coca+cola&tbm=nws")

    for soup in pages:
        print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
        print()

        for data in soup.findAll('div', class_='dbsr'):
            title = data.find('div', class_='JheGif nDgy9d').text
            link = data.a['href']

            print(f'Title: {title}')
            print(f'Link: {link}')
            print()


或者,您可以使用來自 SerpApi 的Google 新聞結果 API來實現相同的目的。 這是一個付費的 API 和免費計划。

您的情況的不同之處在於它支持多個搜索引擎,並且設置過程快速而直接。 您不必維護解析器或弄清楚如何繞過 Google 或其他引擎的塊或如何提取某些元素,因為它已經為最終用戶完成。

要集成的代碼:

# https://github.com/serpapi/google-search-results-python
from serpapi import GoogleSearch
import os

def scrape():
  params = {
    "engine": "google",
    "q": "gta san andreas",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  pages = search.pagination()

  for result in pages:
    print(f"Current page: {result['serpapi_pagination']['current']}")

    for news_result in result["news_results"]:
        print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

PS-我寫了一篇關於如何抓取Google News的更詳細的博客文章。

免責聲明,我為 SerpApi 工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM