简体   繁体   English

BeautifulSoup 脚本解析 Google 搜索结果停止工作

[英]BeautifulSoup script parsing Google search results stopped working

I would like to parse Google search results with Python.我想用 Python 解析 Google 搜索结果。 Everything worked perfectly, but now I keep getting an empty list.一切正常,但现在我不断收到一个空列表。 Here is the code that used to work fine:这是曾经可以正常工作的代码:

query = urllib.urlencode({'q': self.Tagsinput.GetValue()+footprint,'ie': 'utf-8', 'num':searchresults, 'start': '100'})
result = url + query1
myopener = MyOpener()
page = myopener.open(result)
xss = page.read()
soup = BeautifulSoup.BeautifulSoup(xss)
contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]

This script worked perfectly in December, now it stopped working.这个脚本在 12 月运行得很好,现在它停止工作了。

As far as I understand the problem is in this line:据我了解,问题出在这一行:

contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]

when I print contents the program returns an empty list: []当我打印内容时,程序返回一个空列表:[]

Please, anybody, help.请任何人帮忙。

The API works a whole lot better, too. API 也运行得更好。 Simple JSON which you can easily parse and manipulate.您可以轻松解析和操作的简单 JSON。

import urllib, json
BASE_URL = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&'
url = BASE_URL + urllib.urlencode({'q' : SearchTerm.encode('utf-8')})
raw_res = urllib.urlopen(url).read()
results = json.loads(raw_res)
hit1 = results['responseData']['results'][0]
prettyresult = ' - '.join((urllib.unquote(hit1['url']), hit1['titleNoFormatting']))

At the time of writing this answer you don't have to parse <script> tag ( for the most part ) to get the output from the Google Search.在撰写此答案时,您不必解析<script>标记(在大多数情况下)即可从 Google 搜索中获取输出。 This can be achieved by using beautifulsoup , requests , and lxml libraries.这可以通过使用beautifulsouprequestslxml库来实现。

Code to get the title, link, and example in the online IDE : 在线IDE中获取标题、链接和示例的代码:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get(f'https://www.google.com/search?q=minecraft', headers=headers).text
soup = BeautifulSoup(html, 'lxml')

for container in soup.findAll('div', class_='tF2Cxc'):
    title = container.select_one('.DKV0Md').text
    link = container.find('a')['href']
    print(f'{title}\n{link}')

# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''

Alternatively, you can do it as well by using Google Search Engine Results API from SerpApi.或者,您也可以使用来自 SerpApi 的Google Search Engine Results API来实现。 It's a paid API with a free trial of 5,000 searches.这是一个付费 API,可免费试用 5,000 次搜索。 Check out the Playground .看看游乐场

Code to integrate:集成代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"), # environment for API_KEY
  "engine": "google",
  "q": "minecraft",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  print(f'{title}\n{link}')

# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM