Selenium和Beautiful Soup一起使用

Question

I'm scraping a google scholar profile page, and right now I have python code from the beautiful soup library which collects data from the page:我正在抓取一个谷歌学者个人资料页面，现在我有 python 代码来自美丽的汤库，它从页面收集数据：

url = "https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en"
while True:

    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')
    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
    
    for research in research_article:
        
        title = research.find('a',{'class':'gsc_a_at'}).text 
        authors = research.find('div',{'class':'gs_gray'}).text
    
        print('Title:', title,'\n','\nAuthors:', authors)

I also have python code from the selenium library that automates the profile page to click the 'show more' button:我还有来自 selenium 库的 python 代码，它可以自动执行配置文件页面以单击“显示更多”按钮：

driver = webdriver.Chrome(executable_path ="/Applications/chromedriver84")
driver.get(url)

try:
    #Wait up to 10s until the element is loaded on the page
    element = WebDriverWait(driver, 10).until(
        #Locate element by id
        EC.presence_of_element_located((By.ID, 'gsc_bpf_more'))
    )
finally:
    element.click()

How can I combine these two blocks of code so that I can click the 'show more' button, and scrape the entire page?我怎样才能将这两个代码块结合起来，以便我可以单击“显示更多”按钮并抓取整个页面？ Thanks in advance!提前致谢！

Answer 1

This approach is very similar to Andrej Kesely's answer except the if statement, code style, and provides an alternative solution.这种方法与Andrej Kesely 的回答非常相似，除了if语句、代码风格，并提供了替代解决方案。

Code and example in the online IDE : 在线 IDE 中的代码和示例：

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "user": "VjJm3zYAAAAJ", # user-id
    "hl": "en",             # language
    "gl": "us",             # country to search from
    "cstart": 0,            # articles page. 0 is the first page
    "pagesize": "100"       # articles per page
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}

articles_is_present = True

while articles_is_present:
    # timeout to stop waiting for response after 30 sec
    html = requests.post("https://scholar.google.com/citations", params=params, headers=headers, timeout=30) 
    soup = BeautifulSoup(html.text, "lxml")

    for index, article in enumerate(soup.select("#gsc_a_b .gsc_a_t"), start=1):
        article_title = article.select_one(".gsc_a_at").text
        article_link = f'https://scholar.google.com{article.select_one(".gsc_a_at")["href"]}'
        article_authors = article.select_one(".gsc_a_at+ .gs_gray").text
        article_publication = article.select_one(".gs_gray+ .gs_gray").text

        print(f"article #{int(params['cstart']) + index}",
              article_title,
              article_link,
              article_authors,
              article_publication, sep="\n")

    # this selector is checking for the .class that contains: "There are no articles in this profile."
    # example link: https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en&cstart=500&pagesize=100
    if soup.select_one(".gsc_a_e"):
        articles_is_present = False
    else:
        params["cstart"] += 100  # paginate to the next page


# output:
'''
article #1
Hyper-heuristics: A survey of the state of the art
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:5N-NJrZHaHcC
EK Burke, M Gendreau, M Hyde, G Kendall, G Ochoa, E Özcan, R Qu
Journal of the Operational Research Society 64 (12), 1695-1724, 2013

article #2
Hyper-heuristics: An emerging direction in modern search technology
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:u5HHmVD_uO8C
E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
Handbook of metaheuristics, 457-474, 2003
...
article #428
A Library of Vehicle Routing Problems
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:bnK-pcrLprsC
T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese

article #429
This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for …
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:D_sINldO8mEC
S Louis, G Kendall
'''

Alternatively, you can achieve it using Google Scholar Author API from SerpApi.或者，您可以使用来自 SerpApi 的Google Scholar Author API来实现它。 It's a paid API with a free plan.这是带有免费计划的付费 API。

Essentially it is almost the same thing except you don't have to think about how to scale the number of requests, find a good proxy/captcha provider because it's done for the user, or how scrape data from JavaScript without using a browser automation.从本质上讲，这几乎是一回事，除了您不必考虑如何扩展请求的数量，找到一个好的代理/验证码提供者，因为它是为用户完成的，或者如何在不使用浏览器自动化的情况下从 JavaScript 中抓取数据。

Example code to scrape all author articles:抓取所有作者文章的示例代码：

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import os

params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google_scholar_author",
    "hl": "en",
    "author_id": "VjJm3zYAAAAJ",
    "start": "0",
    "num": "100"
}

search = GoogleSearch(params)

articles_is_present = True

while articles_is_present:
    results = search.get_dict()

    for index, article in enumerate(results["articles"], start=1):
        title = article["title"]
        link = article["link"]
        authors = article["authors"]
        publication = article.get("publication")
        citation_id = article["citation_id"]

        print(f"article #{int(params['start']) + index}", 
              title, 
              link, 
              authors, 
              publication, 
              citation_id, sep="\n")


    if "next" in results.get("serpapi_pagination", []):
        # split URL in parts as a dict() and update search "params" variable to a new page
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query))))
    else:
        articles_is_present = False

# output:
'''
article #1
Hyper-heuristics: A survey of the state of the art
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:5N-NJrZHaHcC
EK Burke, M Gendreau, M Hyde, G Kendall, G Ochoa, E Özcan, R Qu
Journal of the Operational Research Society 64 (12), 1695-1724, 2013
VjJm3zYAAAAJ:5N-NJrZHaHcC

article #2
Hyper-heuristics: An emerging direction in modern search technology
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:u5HHmVD_uO8C
E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
Handbook of metaheuristics, 457-474, 2003
VjJm3zYAAAAJ:u5HHmVD_uO8C
...
article #428
A Library of Vehicle Routing Problems
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:bnK-pcrLprsC
T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
None
VjJm3zYAAAAJ:bnK-pcrLprsC

article #429
This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for …
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:D_sINldO8mEC
S Louis, G Kendall
None
VjJm3zYAAAAJ:D_sINldO8mEC
'''

If you would like to scrape organic, cite results from all available pages, there's a dedicated blog post of mine - Scrape historic Google Scholar results using Python at SerpApi.如果您想从所有可用页面中抓取有机的、引用的结果，我有一篇专门的博客文章 - 在 SerpApi 使用 Python 抓取历史性的 Google 学术搜索结果。

Disclaimer, I work for SerpApi.免责声明，我为 SerpApi 工作。

Answer 2

This script will print all titles and authors from the page:此脚本将从页面打印所有标题和作者：

import re
import requests
from bs4 import BeautifulSoup


url = 'https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en'
api_url = 'https://scholar.google.com/citations?user={user}&hl=en&cstart={start}&pagesize={pagesize}'
user_id = re.search(r'user=(.*?)&', url).group(1)

start = 0
while True:
    soup = BeautifulSoup( requests.post(api_url.format(user=user_id, start=start, pagesize=100)).content, 'html.parser' )

    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})

    for i, research in enumerate(research_article, 1):
        title = research.find('a',{'class':'gsc_a_at'})
        authors = research.find('div',{'class':'gs_gray'})

        print('{:04d} {:<80} {}'.format(start+i, title.text, authors.text))

    if len(research_article) != 100:
        break

    start += 100

Prints:印刷：

0001 Hyper-heuristics: A Survey of the State of the Art                               EK Burke, M Hyde, G Kendall, G Ochoa, E Ozcan, R Qu
0002 Hyper-heuristics: An emerging direction in modern search technology              E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
0003 Search methodologies: introductory tutorials in optimization and decision support techniques E Burke, EK Burke, G Kendall
0004 A tabu-search hyperheuristic for timetabling and rostering                       EK Burke, G Kendall, E Soubeiga
0005 A hyperheuristic approach to scheduling a sales summit                           P Cowling, G Kendall, E Soubeiga
0006 A classification of hyper-heuristic approaches                                   EK Burker, M Hyde, G Kendall, G Ochoa, E Özcan, JR Woodward
0007 Genetic algorithms                                                               K Sastry, D Goldberg, G Kendall

...

0431 Solution Methodologies for generating robust Airline Schedules                   F Bian, E Burke, S Jain, G Kendall, GM Koole, J Mulder, MCE Paelinck, ...
0432 A Triple objective function with a chebychev dynamic point specification approach to optimise the surface mount placement machine M Ayob, G Kendall
0433 A Library of Vehicle Routing Problems                                            T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
0434 This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for … S Louis, G Kendall

Selenium和Beautiful Soup一起使用

问题描述

2 个解决方案

解决方案1
1 2022-02-08 14:24:09

解决方案2
0 已采纳 2020-08-09 07:49:43

Selenium和Beautiful Soup一起使用

问题描述

2 个解决方案

解决方案1 1 2022-02-08 14:24:09

解决方案2 0 已采纳 2020-08-09 07:49:43

解决方案1
1 2022-02-08 14:24:09

解决方案2
0 已采纳 2020-08-09 07:49:43