简体   繁体   English

在 python 中使用 linkGrabber 从谷歌搜索中获取“href”

[英]Using linkGrabber to get 'href' from google search in python

Ok, so all I want to do is get the very first link inside the first google search.好的,所以我想做的就是在第一个谷歌搜索中获得第一个链接。 I tried to use beautifoulsoup but it didn't work out at all, I couldn't seem to find a way to get the link.我尝试使用 beautifoulsoup,但它根本没有成功,我似乎无法找到获取链接的方法。 I tried using linkGrabber, so now I get all the urls in the google search (I have limited the results to only 1 per page).我尝试使用 linkGrabber,所以现在我在谷歌搜索中获得了所有 url(我将结果限制为每页只有 1 个)。 My code is:我的代码是:

import re
import linkGrabber
import urllib

input = str(input('Give movie name:  '))
input = urllib.parse.quote_plus(input)
imdb_s = '+imdb+review'
n = 1
g_s = 'https://www.google.com/search?q='+ input + imdb_s +'&num=' + str(n)
links = linkGrabber.Links(g_s)
gb = links.find(pretty=True)
print(gb)

however when I print, i get like 15 links that are from google and which I do not want to use, I want to focus only on one specific href, and get this.但是,当我打印时,我得到了 15 个来自谷歌的链接,我不想使用这些链接,我只想专注于一个特定的 href,然后得到这个。 Can anyone please help me?谁能帮帮我吗?

you can use the google search library - i think pip install google.你可以使用谷歌搜索库 - 我认为 pip 安装谷歌。 This library also relies on beautiful soup, but is fit to return only search results.这个库也依赖于漂亮的汤,但适合只返回搜索结果。 The problem is that the page that google returns when you search has ads and a bunch of other links that aren't the actual search results.问题是当您搜索时 google 返回的页面有广告和一堆不是实际搜索结果的其他链接。

You can also change your query to "site:imdb.com+" to only search on imbd.您还可以将查询更改为“site:imdb.com+”以仅在 imbd 上搜索。

That said, I've stopped using that for my googling needs because it's against googles terms of service.也就是说,我已经停止使用它来满足我的谷歌搜索需求,因为它违反了谷歌的服务条款。 I'm not moralizing anything, but the reality is that I can't seem to get much reliability as google keeps sniffing bots and recaptcha-ing them.我不是在说教,但现实是我似乎无法获得太多的可靠性,因为谷歌一直在嗅探机器人并重新验证它们。

The correct way to do it would be to use google's custom search API - which is also good for only returning the info you need, and it's free for 100 searches per day.正确的方法是使用谷歌的自定义搜索 API - 这也适用于仅返回您需要的信息,并且每天 100 次搜索是免费的。

To get the very first link you can use select_one() bs4 method.要获得第一个链接,您可以使用select_one() bs4方法。

It didn't work because you don't specify a user-agent ( headers ) which is faking real user visits, so Google won't treat your request as a default request user-agent which is: python-requests .它不起作用,因为您没有指定伪造真实用户访问的user-agentheaders ),因此 Google 不会将您的请求视为默认请求user-agent ,即: python-requests

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

Code and example in the online IDE : 在线 IDE中的代码和示例:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get(f'https://www.google.com/search?q=minecraft', headers=headers).text
soup = BeautifulSoup(html, 'lxml')

for container in soup.findAll('div', class_='tF2Cxc'):
    title = container.select_one('.DKV0Md').text
    link = container.find('a')['href']
    print(f'{title}\n{link}')

# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''

Alternatively, you can do it as well by using Google Search Engine Results API from SerpApi.或者,您也可以使用来自 SerpApi 的Google 搜索引擎结果 API来执行此操作。 It's a paid API with a free trial of 5,000 searches.它是付费的 API,可免费试用 5,000 次搜索。

The main difference is that you don't have to think about why Google is blocks you, why certain selector is giving wrong output, even though it shouldn't.主要区别在于您不必考虑为什么 Google 会阻止您,为什么某些选择器会给出错误的 output,即使它不应该。 It's already done for the end-user with a JSON output. JSON output 已经为最终用户完成。

Check out the Playground .看看操场

Code to integrate:要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"), # environment for API_KEY
  "engine": "google",
  "q": "minecraft",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  print(f'{title}\n{link}')

# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM