[英]Retrieving links from a Google search using BeautifulSoup in Python
[英]Unable to extract links from a google search using beautifulsoup python
我想提取谷歌搜索后頁面上的鏈接,
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.google.com/search?q=machine+learning')
soup = BeautifulSoup(response.text, 'html.parser')
soup.find_all('div', class_='r')
但它給了我空列表[]
有沒有辦法做到這一點?
如果您正在使用硒,您應該會獲得預期的輸出。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome("path of the chrome driver")
driver.get("https://www.google.com/search?q=machine+learning")
elements=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'div.r')))
for ele in elements:
print(ele.find_element_by_xpath("./a").get_attribute('href'))
輸出:
https://www.expertsystem.com/machine-learning-definition/
https://www.geeksforgeeks.org/top-5-best-programming-languages-for-artificial-intelligence-field/
https://www.geeksforgeeks.org/difference-between-machine-learning-and-artificial-intelligence/
http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
https://machinelearningmastery.com/start-here/
https://en.wikipedia.org/wiki/Machine_learning
https://www.sas.com/en_gb/insights/analytics/machine-learning.html
https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12
https://www.coursera.org/learn/machine-learning
https://www.expertsystem.com/machine-learning-definition/
https://searchenterpriseai.techtarget.com/definition/machine-learning-ML
https://emerj.com/ai-glossary-terms/what-is-machine-learning/
https://www.geeksforgeeks.org/machine-learning/
嘗試這個
import requests
from bs4 import BeautifulSoup
import re
search = input("Search:")
results = 100 # valid options 10, 20, 30, 40, 50, and 100
page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results))
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
for link in links :
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
有沒有在無需selenium
作為KunduK建議,或使事情那么復雜的馬拉里Kagathara建議這樣的任務。
問題是因為沒有指定user-agent
因此谷歌阻止了一個請求,你收到了一個帶有不同選擇器的完全不同的 HTML,因為默認requests
user-agent
是python-requests
。 了解有關請求標頭的更多信息。
將user-agent
傳遞到請求headers
:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
提取鏈接非常簡單:
# container with needed data
for result in soup.select('.tF2Cxc'):
# extracting links from container and grabbing href attribute
link = result.select_one('.yuRUbf a')['href']
查看SelectorGadget Chrome 擴展程序,通過單擊瀏覽器中的所需元素來獲取CSS
選擇器。 CSS
選擇器參考.
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah", # query
"hl": "en", # language
"num": "10" # number of results
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
-----
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://tenor.com/search/fus-ro-dah-gifs
https://marketplace.xbox.com/en-US/Product/Skyrim-Fus-Ro-Dah/00001000-b646-c203-c05e-7534425307e6
'''
或者,您可以使用來自 SerpApi 的Google Results API來實現這一點。 這是一個帶有免費計划的付費 API。
您的情況的不同之處在於,已經為最終用戶完成了從塊部分中提取和繞過的工作,真正需要做的就是迭代結構化 JSON 並獲取您想要的數據。
集成代碼:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://tenor.com/search/fus-ro-dah-gifs
https://marketplace.xbox.com/en-US/Product/Skyrim-Fus-Ro-Dah/00001000-b646-c203-c05e-7534425307e6
'''
免責聲明,我為 SerpApi 工作。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.