簡體   English   中英

Selenium和Beautiful Soup一起使用

[英]Using Selenium and Beautiful Soup together

我正在抓取一個谷歌學者個人資料頁面,現在我有 python 代碼來自美麗的湯庫,它從頁面收集數據:

url = "https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en"
while True:

    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')
    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
    
    for research in research_article:
        
        title = research.find('a',{'class':'gsc_a_at'}).text 
        authors = research.find('div',{'class':'gs_gray'}).text
    
        print('Title:', title,'\n','\nAuthors:', authors)

我還有來自 selenium 庫的 python 代碼,它可以自動執行配置文件頁面以單擊“顯示更多”按鈕:

driver = webdriver.Chrome(executable_path ="/Applications/chromedriver84")
driver.get(url)

try:
    #Wait up to 10s until the element is loaded on the page
    element = WebDriverWait(driver, 10).until(
        #Locate element by id
        EC.presence_of_element_located((By.ID, 'gsc_bpf_more'))
    )
finally:
    element.click()

我怎樣才能將這兩個代碼塊結合起來,以便我可以單擊“顯示更多”按鈕並抓取整個頁面? 提前致謝!

這種方法與Andrej Kesely 的回答非常相似,除了if語句、代碼風格,並提供了替代解決方案。

在線 IDE 中的代碼和示例

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "user": "VjJm3zYAAAAJ", # user-id
    "hl": "en",             # language
    "gl": "us",             # country to search from
    "cstart": 0,            # articles page. 0 is the first page
    "pagesize": "100"       # articles per page
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}

articles_is_present = True

while articles_is_present:
    # timeout to stop waiting for response after 30 sec
    html = requests.post("https://scholar.google.com/citations", params=params, headers=headers, timeout=30) 
    soup = BeautifulSoup(html.text, "lxml")

    for index, article in enumerate(soup.select("#gsc_a_b .gsc_a_t"), start=1):
        article_title = article.select_one(".gsc_a_at").text
        article_link = f'https://scholar.google.com{article.select_one(".gsc_a_at")["href"]}'
        article_authors = article.select_one(".gsc_a_at+ .gs_gray").text
        article_publication = article.select_one(".gs_gray+ .gs_gray").text

        print(f"article #{int(params['cstart']) + index}",
              article_title,
              article_link,
              article_authors,
              article_publication, sep="\n")

    # this selector is checking for the .class that contains: "There are no articles in this profile."
    # example link: https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en&cstart=500&pagesize=100
    if soup.select_one(".gsc_a_e"):
        articles_is_present = False
    else:
        params["cstart"] += 100  # paginate to the next page


# output:
'''
article #1
Hyper-heuristics: A survey of the state of the art
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:5N-NJrZHaHcC
EK Burke, M Gendreau, M Hyde, G Kendall, G Ochoa, E Özcan, R Qu
Journal of the Operational Research Society 64 (12), 1695-1724, 2013

article #2
Hyper-heuristics: An emerging direction in modern search technology
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:u5HHmVD_uO8C
E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
Handbook of metaheuristics, 457-474, 2003
...
article #428
A Library of Vehicle Routing Problems
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:bnK-pcrLprsC
T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese

article #429
This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for …
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:D_sINldO8mEC
S Louis, G Kendall
'''

或者,您可以使用來自 SerpApi 的Google Scholar Author API來實現它。 這是帶有免費計划的付費 API。

從本質上講,這幾乎是一回事,除了您不必考慮如何擴展請求的數量,找到一個好的代理/驗證碼提供者,因為它是為用戶完成的,或者如何在不使用瀏覽器自動化的情況下從 JavaScript 中抓取數據。

抓取所有作者文章的示例代碼:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import os

params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google_scholar_author",
    "hl": "en",
    "author_id": "VjJm3zYAAAAJ",
    "start": "0",
    "num": "100"
}

search = GoogleSearch(params)

articles_is_present = True

while articles_is_present:
    results = search.get_dict()

    for index, article in enumerate(results["articles"], start=1):
        title = article["title"]
        link = article["link"]
        authors = article["authors"]
        publication = article.get("publication")
        citation_id = article["citation_id"]

        print(f"article #{int(params['start']) + index}", 
              title, 
              link, 
              authors, 
              publication, 
              citation_id, sep="\n")


    if "next" in results.get("serpapi_pagination", []):
        # split URL in parts as a dict() and update search "params" variable to a new page
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query))))
    else:
        articles_is_present = False

# output:
'''
article #1
Hyper-heuristics: A survey of the state of the art
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:5N-NJrZHaHcC
EK Burke, M Gendreau, M Hyde, G Kendall, G Ochoa, E Özcan, R Qu
Journal of the Operational Research Society 64 (12), 1695-1724, 2013
VjJm3zYAAAAJ:5N-NJrZHaHcC

article #2
Hyper-heuristics: An emerging direction in modern search technology
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:u5HHmVD_uO8C
E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
Handbook of metaheuristics, 457-474, 2003
VjJm3zYAAAAJ:u5HHmVD_uO8C
...
article #428
A Library of Vehicle Routing Problems
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:bnK-pcrLprsC
T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
None
VjJm3zYAAAAJ:bnK-pcrLprsC

article #429
This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for …
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:D_sINldO8mEC
S Louis, G Kendall
None
VjJm3zYAAAAJ:D_sINldO8mEC
'''

如果您想從所有可用頁面中抓取有機的、引用的結果,我有一篇專門的博客文章 - 在 SerpApi 使用 Python 抓取歷史性的 Google 學術搜索結果

免責聲明,我為 SerpApi 工作。

此腳本將從頁面打印所有標題和作者:

import re
import requests
from bs4 import BeautifulSoup


url = 'https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en'
api_url = 'https://scholar.google.com/citations?user={user}&hl=en&cstart={start}&pagesize={pagesize}'
user_id = re.search(r'user=(.*?)&', url).group(1)

start = 0
while True:
    soup = BeautifulSoup( requests.post(api_url.format(user=user_id, start=start, pagesize=100)).content, 'html.parser' )

    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})

    for i, research in enumerate(research_article, 1):
        title = research.find('a',{'class':'gsc_a_at'})
        authors = research.find('div',{'class':'gs_gray'})

        print('{:04d} {:<80} {}'.format(start+i, title.text, authors.text))

    if len(research_article) != 100:
        break

    start += 100

印刷:

0001 Hyper-heuristics: A Survey of the State of the Art                               EK Burke, M Hyde, G Kendall, G Ochoa, E Ozcan, R Qu
0002 Hyper-heuristics: An emerging direction in modern search technology              E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
0003 Search methodologies: introductory tutorials in optimization and decision support techniques E Burke, EK Burke, G Kendall
0004 A tabu-search hyperheuristic for timetabling and rostering                       EK Burke, G Kendall, E Soubeiga
0005 A hyperheuristic approach to scheduling a sales summit                           P Cowling, G Kendall, E Soubeiga
0006 A classification of hyper-heuristic approaches                                   EK Burker, M Hyde, G Kendall, G Ochoa, E Özcan, JR Woodward
0007 Genetic algorithms                                                               K Sastry, D Goldberg, G Kendall

...

0431 Solution Methodologies for generating robust Airline Schedules                   F Bian, E Burke, S Jain, G Kendall, GM Koole, J Mulder, MCE Paelinck, ...
0432 A Triple objective function with a chebychev dynamic point specification approach to optimise the surface mount placement machine M Ayob, G Kendall
0433 A Library of Vehicle Routing Problems                                            T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
0434 This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for … S Louis, G Kendall

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM