简体   繁体   English

使用 BeautifulSoup 抓取 Google 搜索结果

[英]Web scrape Google search results using BeautifulSoup

My goal is to web scrape Google search results using BeautifulSoup.我的目标是使用 BeautifulSoup 网络抓取 Google 搜索结果。 I am using Anaconda Python and use Ipython as the IDE console.我正在使用 Anaconda Python 并使用 Ipython 作为 IDE 控制台。 Why don't I get an ouptput when run the following command?为什么我在运行以下命令时没有得到输出?

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        print sLink['href']
        sSpan = li.find('span', attrs={'class':'st'})
        print sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('english')

You are never adding anything to linkedDictionary您永远不会向linkedDictionary 添加任何内容

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        sSpan = li.find('span', attrs={'class':'st'})

        linkeDictionary['href'] = sLink['href']
        linkedDictionary['sSpan'] = sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('english')

The problem as Cody Bouche mentioned is that nothing has been adding to the dict() . Cody Bouche提到的问题是没有任何内容添加到dict() In my opinion, you'll have hard times updating your dict if you haven't change {} (dict) to [] (array).在我看来,如果您没有将{} (dict) 更改为[] (array),您将很难更新您的 dict。

Appending to array is much simpler ( note: I could be wrong here, it's just a personal opinion from previous experience ).附加到数组要简单得多(注意:我在这里可能是错的,这只是以前经验的个人意见)。

To make it work in a simple maner, you need to change dict to array {} --> [] and then use .append({}) to append to list()为了使它以简单的方式工作,您需要将dict更改为array {} --> []然后使用.append({})附加到list()

Code and example in the online IDE : 在线IDE中的代码和示例:

def google_scrape(query):
    html = requests.get(f'https://www.google.com/search?q={query}', headers=headers).text
    soup = BeautifulSoup(html, 'lxml')

    data = []

    for container in soup.findAll('div', class_='tF2Cxc'):
        title = container.select_one('.DKV0Md').text
        link = container.find('a')['href']

        data.append({
          'title': title,
          'link': link,
        })
        print(f'{title}\n{link}')

    print(json.dumps(data, indent=2))

google_scrape('english')

# part of the outputs:
'''
English language - Wikipedia
https://en.wikipedia.org/wiki/English_language
[
  {
    "title": "English language - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/English_language"
  },
]
'''

If you still want to append to dict() then this is one of the ways of approaching this ( only part of the for loop shown ):如果您仍然想附加到dict()那么这是解决此问题的方法之一(仅显示 for 循环的一部分):

for container in soup.findAll('div', class_='tF2Cxc'):

    data_dict = {}

    title = container.select_one('.DKV0Md').text
    link = container.find('a')['href']
    # creates title key and assigns title value
    data_dict['title'] = title
    # creates link key and assigns link value
    data_dict['link'] = link

    print(json.dumps(data_dict, indent = 2))

# part of the output:
'''
{
  "title": "Minecraft Official Site | Minecraft",
  "link": "https://www.minecraft.net/en-us/"
}
'''

To get a dict output right away you can do the same thing using Google Search Engine Results API from SerpApi.要立即获得 dict 输出,您可以使用来自 SerpApi 的Google Search Engine Results API做同样的事情。 It's a paid API with a free trial of 5,000 searches.这是一个付费 API,可免费试用 5,000 次搜索。

Essentially, it's doing the same thing as the code above, but you don't to figure out how to do certain things or trying to understand how to scrape certain element, it's already done for the end-user with a JSON output so the only thing that needs to be done is to iterate over a JSON and get the desired output.本质上,它和上面的代码做同样的事情,但你不需要弄清楚如何做某些事情或试图理解如何抓取某些元素,它已经为最终用户完成了JSON输出,所以唯一的需要做的事情是迭代JSON并获得所需的输出。

Code to integrate:集成代码:

from serpapi import GoogleSearch
import json

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "minecraft",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  print(json.dumps(result, indent = 2, ensure_ascii = False))

# part of the json output:
'''
{
  "position": 1,
  "title": "Minecraft - Aplikasi di Google Play",
  "link": "https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=in&gl=US",
  "displayed_link": "https://play.google.com › store › apps › details › id=co...",
  "rich_snippet": {
    "top": {
      "detected_extensions": {
        "skor": 46,
        "suara": 4144655,
        "us": 749
      },
      "extensions": [
        "Skor: 4,6",
        "‎4.144.655 suara",
        "‎US$7,49",
        "‎Android",
        "‎Game"
    ]
  }
}
'''

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM