无法使用beautifulsoup python从谷歌搜索中提取链接

Question

我想提取谷歌搜索后页面上的链接，

import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.google.com/search?q=machine+learning')
soup = BeautifulSoup(response.text, 'html.parser')

soup.find_all('div', class_='r')

但它给了我空列表[]

有没有办法做到这一点？

Answer 1

如果您正在使用硒，您应该会获得预期的输出。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome("path of the chrome driver")
driver.get("https://www.google.com/search?q=machine+learning")
elements=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'div.r')))
for ele in elements:
  print(ele.find_element_by_xpath("./a").get_attribute('href'))

输出：

https://www.expertsystem.com/machine-learning-definition/
https://www.geeksforgeeks.org/top-5-best-programming-languages-for-artificial-intelligence-field/
https://www.geeksforgeeks.org/difference-between-machine-learning-and-artificial-intelligence/
http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
https://machinelearningmastery.com/start-here/
https://en.wikipedia.org/wiki/Machine_learning
https://www.sas.com/en_gb/insights/analytics/machine-learning.html
https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12
https://www.coursera.org/learn/machine-learning
https://www.expertsystem.com/machine-learning-definition/
https://searchenterpriseai.techtarget.com/definition/machine-learning-ML
https://emerj.com/ai-glossary-terms/what-is-machine-learning/
https://www.geeksforgeeks.org/machine-learning/

Answer 2

尝试这个

import requests
from bs4 import BeautifulSoup
import re

search = input("Search:")
results = 100 # valid options 10, 20, 30, 40, 50, and 100
page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results))
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
for link in links :
    link_href = link.get('href')
    if "url?q=" in link_href and not "webcache" in link_href:
        print(link.get('href').split("?q=")[1].split("&sa=U")[0])

Answer 3

有没有在无需selenium作为KunduK建议，或使事情那么复杂的马拉里Kagathara建议这样的任务。

问题是因为没有指定user-agent因此谷歌阻止了一个请求，你收到了一个带有不同选择器的完全不同的 HTML，因为默认requests user-agent是python-requests 。 了解有关请求标头的更多信息。

将user-agent传递到请求headers ：

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get("YOUR_URL", headers=headers)

提取链接非常简单：

# container with needed data
for result in soup.select('.tF2Cxc'):
  # extracting links from container and grabbing href attribute
  link = result.select_one('.yuRUbf a')['href']

查看SelectorGadget Chrome 扩展程序，通过单击浏览器中的所需元素来获取CSS选择器。 CSS选择器参考.

在线 IDE 中的代码和完整示例：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah",  # query
  "hl": "en",         # language
  "num": "10"         # number of results
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']

-----
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://tenor.com/search/fus-ro-dah-gifs
https://marketplace.xbox.com/en-US/Product/Skyrim-Fus-Ro-Dah/00001000-b646-c203-c05e-7534425307e6
'''

或者，您可以使用来自 SerpApi 的Google Results API来实现这一点。 这是一个带有免费计划的付费 API。

您的情况的不同之处在于，已经为最终用户完成了从块部分中提取和绕过的工作，真正需要做的就是迭代结构化 JSON 并获取您想要的数据。

集成代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://tenor.com/search/fus-ro-dah-gifs
https://marketplace.xbox.com/en-US/Product/Skyrim-Fus-Ro-Dah/00001000-b646-c203-c05e-7534425307e6
'''

免责声明，我为 SerpApi 工作。

无法使用beautifulsoup python从谷歌搜索中提取链接

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-07-26 11:52:00

解决方案2
1 2019-07-26 12:21:43

解决方案3
0 2021-09-06 15:14:45

无法使用beautifulsoup python从谷歌搜索中提取链接

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-07-26 11:52:00

解决方案2 1 2019-07-26 12:21:43

解决方案3 0 2021-09-06 15:14:45

解决方案1
1 已采纳 2019-07-26 11:52:00

解决方案2
1 2019-07-26 12:21:43

解决方案3
0 2021-09-06 15:14:45