简体   繁体   English

Selenium和Beautiful Soup一起使用

[英]Using Selenium and Beautiful Soup together

I'm scraping a google scholar profile page, and right now I have python code from the beautiful soup library which collects data from the page:我正在抓取一个谷歌学者个人资料页面,现在我有 python 代码来自美丽的汤库,它从页面收集数据:

url = "https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en"
while True:

    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')
    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
    
    for research in research_article:
        
        title = research.find('a',{'class':'gsc_a_at'}).text 
        authors = research.find('div',{'class':'gs_gray'}).text
    
        print('Title:', title,'\n','\nAuthors:', authors)

I also have python code from the selenium library that automates the profile page to click the 'show more' button:我还有来自 selenium 库的 python 代码,它可以自动执行配置文件页面以单击“显示更多”按钮:

driver = webdriver.Chrome(executable_path ="/Applications/chromedriver84")
driver.get(url)

try:
    #Wait up to 10s until the element is loaded on the page
    element = WebDriverWait(driver, 10).until(
        #Locate element by id
        EC.presence_of_element_located((By.ID, 'gsc_bpf_more'))
    )
finally:
    element.click()

How can I combine these two blocks of code so that I can click the 'show more' button, and scrape the entire page?我怎样才能将这两个代码块结合起来,以便我可以单击“显示更多”按钮并抓取整个页面? Thanks in advance!提前致谢!

This approach is very similar to Andrej Kesely's answer except the if statement, code style, and provides an alternative solution.这种方法与Andrej Kesely 的回答非常相似,除了if语句、代码风格,并提供了替代解决方案。

Code and example in the online IDE : 在线 IDE 中的代码和示例

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "user": "VjJm3zYAAAAJ", # user-id
    "hl": "en",             # language
    "gl": "us",             # country to search from
    "cstart": 0,            # articles page. 0 is the first page
    "pagesize": "100"       # articles per page
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}

articles_is_present = True

while articles_is_present:
    # timeout to stop waiting for response after 30 sec
    html = requests.post("https://scholar.google.com/citations", params=params, headers=headers, timeout=30) 
    soup = BeautifulSoup(html.text, "lxml")

    for index, article in enumerate(soup.select("#gsc_a_b .gsc_a_t"), start=1):
        article_title = article.select_one(".gsc_a_at").text
        article_link = f'https://scholar.google.com{article.select_one(".gsc_a_at")["href"]}'
        article_authors = article.select_one(".gsc_a_at+ .gs_gray").text
        article_publication = article.select_one(".gs_gray+ .gs_gray").text

        print(f"article #{int(params['cstart']) + index}",
              article_title,
              article_link,
              article_authors,
              article_publication, sep="\n")

    # this selector is checking for the .class that contains: "There are no articles in this profile."
    # example link: https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en&cstart=500&pagesize=100
    if soup.select_one(".gsc_a_e"):
        articles_is_present = False
    else:
        params["cstart"] += 100  # paginate to the next page


# output:
'''
article #1
Hyper-heuristics: A survey of the state of the art
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:5N-NJrZHaHcC
EK Burke, M Gendreau, M Hyde, G Kendall, G Ochoa, E Özcan, R Qu
Journal of the Operational Research Society 64 (12), 1695-1724, 2013

article #2
Hyper-heuristics: An emerging direction in modern search technology
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:u5HHmVD_uO8C
E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
Handbook of metaheuristics, 457-474, 2003
...
article #428
A Library of Vehicle Routing Problems
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:bnK-pcrLprsC
T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese

article #429
This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for …
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:D_sINldO8mEC
S Louis, G Kendall
'''

Alternatively, you can achieve it using Google Scholar Author API from SerpApi.或者,您可以使用来自 SerpApi 的Google Scholar Author API来实现它。 It's a paid API with a free plan.这是带有免费计划的付费 API。

Essentially it is almost the same thing except you don't have to think about how to scale the number of requests, find a good proxy/captcha provider because it's done for the user, or how scrape data from JavaScript without using a browser automation.从本质上讲,这几乎是一回事,除了您不必考虑如何扩展请求的数量,找到一个好的代理/验证码提供者,因为它是为用户完成的,或者如何在不使用浏览器自动化的情况下从 JavaScript 中抓取数据。

Example code to scrape all author articles:抓取所有作者文章的示例代码:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import os

params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google_scholar_author",
    "hl": "en",
    "author_id": "VjJm3zYAAAAJ",
    "start": "0",
    "num": "100"
}

search = GoogleSearch(params)

articles_is_present = True

while articles_is_present:
    results = search.get_dict()

    for index, article in enumerate(results["articles"], start=1):
        title = article["title"]
        link = article["link"]
        authors = article["authors"]
        publication = article.get("publication")
        citation_id = article["citation_id"]

        print(f"article #{int(params['start']) + index}", 
              title, 
              link, 
              authors, 
              publication, 
              citation_id, sep="\n")


    if "next" in results.get("serpapi_pagination", []):
        # split URL in parts as a dict() and update search "params" variable to a new page
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query))))
    else:
        articles_is_present = False

# output:
'''
article #1
Hyper-heuristics: A survey of the state of the art
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:5N-NJrZHaHcC
EK Burke, M Gendreau, M Hyde, G Kendall, G Ochoa, E Özcan, R Qu
Journal of the Operational Research Society 64 (12), 1695-1724, 2013
VjJm3zYAAAAJ:5N-NJrZHaHcC

article #2
Hyper-heuristics: An emerging direction in modern search technology
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&pagesize=100&citation_for_view=VjJm3zYAAAAJ:u5HHmVD_uO8C
E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
Handbook of metaheuristics, 457-474, 2003
VjJm3zYAAAAJ:u5HHmVD_uO8C
...
article #428
A Library of Vehicle Routing Problems
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:bnK-pcrLprsC
T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
None
VjJm3zYAAAAJ:bnK-pcrLprsC

article #429
This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for …
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VjJm3zYAAAAJ&cstart=400&pagesize=100&citation_for_view=VjJm3zYAAAAJ:D_sINldO8mEC
S Louis, G Kendall
None
VjJm3zYAAAAJ:D_sINldO8mEC
'''

If you would like to scrape organic, cite results from all available pages, there's a dedicated blog post of mine - Scrape historic Google Scholar results using Python at SerpApi.如果您想从所有可用页面中抓取有机的、引用的结果,我有一篇专门的博客文章 - 在 SerpApi 使用 Python 抓取历史性的 Google 学术搜索结果

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

This script will print all titles and authors from the page:此脚本将从页面打印所有标题和作者:

import re
import requests
from bs4 import BeautifulSoup


url = 'https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en'
api_url = 'https://scholar.google.com/citations?user={user}&hl=en&cstart={start}&pagesize={pagesize}'
user_id = re.search(r'user=(.*?)&', url).group(1)

start = 0
while True:
    soup = BeautifulSoup( requests.post(api_url.format(user=user_id, start=start, pagesize=100)).content, 'html.parser' )

    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})

    for i, research in enumerate(research_article, 1):
        title = research.find('a',{'class':'gsc_a_at'})
        authors = research.find('div',{'class':'gs_gray'})

        print('{:04d} {:<80} {}'.format(start+i, title.text, authors.text))

    if len(research_article) != 100:
        break

    start += 100

Prints:印刷:

0001 Hyper-heuristics: A Survey of the State of the Art                               EK Burke, M Hyde, G Kendall, G Ochoa, E Ozcan, R Qu
0002 Hyper-heuristics: An emerging direction in modern search technology              E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
0003 Search methodologies: introductory tutorials in optimization and decision support techniques E Burke, EK Burke, G Kendall
0004 A tabu-search hyperheuristic for timetabling and rostering                       EK Burke, G Kendall, E Soubeiga
0005 A hyperheuristic approach to scheduling a sales summit                           P Cowling, G Kendall, E Soubeiga
0006 A classification of hyper-heuristic approaches                                   EK Burker, M Hyde, G Kendall, G Ochoa, E Özcan, JR Woodward
0007 Genetic algorithms                                                               K Sastry, D Goldberg, G Kendall

...

0431 Solution Methodologies for generating robust Airline Schedules                   F Bian, E Burke, S Jain, G Kendall, GM Koole, J Mulder, MCE Paelinck, ...
0432 A Triple objective function with a chebychev dynamic point specification approach to optimise the surface mount placement machine M Ayob, G Kendall
0433 A Library of Vehicle Routing Problems                                            T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
0434 This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for … S Louis, G Kendall

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM