使用 BeautifulSoup 抓取 Google 搜索結果

Question

我的目標是使用 BeautifulSoup 網絡抓取 Google 搜索結果。 我正在使用 Anaconda Python 並使用 Ipython 作為 IDE 控制台。 為什么我在運行以下命令時沒有得到輸出？

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        print sLink['href']
        sSpan = li.find('span', attrs={'class':'st'})
        print sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('english')

Answer 1

您永遠不會向linkedDictionary 添加任何內容

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        sSpan = li.find('span', attrs={'class':'st'})

        linkeDictionary['href'] = sLink['href']
        linkedDictionary['sSpan'] = sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('english')

Answer 2

Cody Bouche提到的問題是沒有任何內容添加到dict() 。 在我看來，如果您沒有將{} (dict) 更改為[] (array)，您將很難更新您的 dict。

附加到數組要簡單得多（注意：我在這里可能是錯的，這只是以前經驗的個人意見）。

為了使它以簡單的方式工作，您需要將dict更改為array {} --> []然后使用.append({})附加到list()

在線IDE中的代碼和示例：

def google_scrape(query):
    html = requests.get(f'https://www.google.com/search?q={query}', headers=headers).text
    soup = BeautifulSoup(html, 'lxml')

    data = []

    for container in soup.findAll('div', class_='tF2Cxc'):
        title = container.select_one('.DKV0Md').text
        link = container.find('a')['href']

        data.append({
          'title': title,
          'link': link,
        })
        print(f'{title}\n{link}')

    print(json.dumps(data, indent=2))

google_scrape('english')

# part of the outputs:
'''
English language - Wikipedia
https://en.wikipedia.org/wiki/English_language
[
  {
    "title": "English language - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/English_language"
  },
]
'''

如果您仍然想附加到dict()那么這是解決此問題的方法之一（僅顯示 for 循環的一部分）：

for container in soup.findAll('div', class_='tF2Cxc'):

    data_dict = {}

    title = container.select_one('.DKV0Md').text
    link = container.find('a')['href']
    # creates title key and assigns title value
    data_dict['title'] = title
    # creates link key and assigns link value
    data_dict['link'] = link

    print(json.dumps(data_dict, indent = 2))

# part of the output:
'''
{
  "title": "Minecraft Official Site | Minecraft",
  "link": "https://www.minecraft.net/en-us/"
}
'''

要立即獲得 dict 輸出，您可以使用來自 SerpApi 的Google Search Engine Results API做同樣的事情。 這是一個付費 API，可免費試用 5,000 次搜索。

本質上，它和上面的代碼做同樣的事情，但你不需要弄清楚如何做某些事情或試圖理解如何抓取某些元素，它已經為最終用戶完成了JSON輸出，所以唯一的需要做的事情是迭代JSON並獲得所需的輸出。

集成代碼：

from serpapi import GoogleSearch
import json

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "minecraft",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  print(json.dumps(result, indent = 2, ensure_ascii = False))

# part of the json output:
'''
{
  "position": 1,
  "title": "Minecraft - Aplikasi di Google Play",
  "link": "https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=in&gl=US",
  "displayed_link": "https://play.google.com › store › apps › details › id=co...",
  "rich_snippet": {
    "top": {
      "detected_extensions": {
        "skor": 46,
        "suara": 4144655,
        "us": 749
      },
      "extensions": [
        "Skor: 4,6",
        "‎4.144.655 suara",
        "‎US$7,49",
        "‎Android",
        "‎Game"
    ]
  }
}
'''

免責聲明，我為 SerpApi 工作。

使用 BeautifulSoup 抓取 Google 搜索結果

問題描述

2 個解決方案

解決方案1
0 2015-08-28 15:36:35

解決方案2
0 2021-06-23 11:24:08

使用 BeautifulSoup 抓取 Google 搜索結果

問題描述

2 個解決方案

解決方案1 0 2015-08-28 15:36:35

解決方案2 0 2021-06-23 11:24:08

解決方案1
0 2015-08-28 15:36:35

解決方案2
0 2021-06-23 11:24:08