简体   繁体   English

使用 BS4 在 Python 中进行 WebScraping - 获取动态生成的列表

[英]WebScraping in Python with BS4 - Getting dynamically generated list

I need to crawl the list of "Best Coding Bootcamps" present in this list: https://www.switchup.org/rankings/best-coding-bootcamps我需要抓取此列表中的“最佳编码训练营”列表: https : //www.switchup.org/rankings/best-coding-bootcamps

My assignment says this should be possible with Beautiful Soup (and not with Selenium) however when I attempt to do that the resulting HTML doesn't return the list of the bootcamps but rather what appears to be an empty element of class:我的作业说这应该可以使用 Beautiful Soup(而不是 Selenium)但是当我尝试这样做时,生成的 HTML 不会返回训练营列表,而是似乎是类的空元素:

My questions is, do you think this content is possible to retrieve only with Beautiful Soup without resorting to Selenium?我的问题是,您是否认为仅使用 Beautiful Soup 而不求助于 Selenium 就可以检索此内容? If Selenium is necessary, what would be a simple code to do so?如果 Selenium 是必要的,那么这样做的简单代码是什么?

The code so far is very simple:到目前为止的代码非常简单:

from bs4 import BeautifulSoup

import requests

import time

url = "https://www.switchup.org/rankings/best-coding-bootcamps"

r = requests.get(url)


soup = BeautifulSoup(r.content,'lxml')
time.sleep(5)

print(soup)

Thank you very much in advance非常感谢您提前

You have right, the page at the URL you've posted is empty.您说得对,您发布的 URL 处的页面是空的。 The data is loaded through AJAX from another URL.数据是通过 AJAX 从另一个 URL 加载的。

If you inspect the Network tab in Firefox/Chrome, you can find this URL (the data is in JSON format):如果您检查 Firefox/Chrome 中的网络选项卡,您可以找到此 URL(数据采用 JSON 格式):

import requests
from bs4 import BeautifulSoup

url = 'https://www.switchup.org/chimera/v1/bootcamp-list?mainTemplate=bootcamp-list%2Frankings&path=%2Frankings%2Fbest-coding-bootcamps&isDataTarget=false&featuredSchools=0&logoTag=logo&logoSize=original&numSchools=0&perPage=0&rankType=BootcampRankings&rankYear=2020&recentReview=true&reviewLength=50&numLocations=5&numSubjects=5&numCourses=5&sortOn=name&withReviews=false'

data = requests.get(url).json()

for i, bootcamp in enumerate(data['content']['bootcamps'], 1):
    soup = BeautifulSoup(bootcamp['description'], 'html.parser')
    print('{}. {}'.format(i, bootcamp['name']))
    print(soup.get_text(strip=True))
    print('-' * 80)

Prints:印刷:

1. Le Wagon
Le Wagon is an intensive international coding bootcamp geared toward career changers and entrepreneurs who want to gain coding skills. Participants complete 450 hours of coding in 9 weeks full-time or 24 weeks part-time, which includes building their own web app. After completing the program, students join an international alumni network of 6,000+ for career support and community.
--------------------------------------------------------------------------------
2. App Academy
App Academy teaches participants everything they need to know about software engineering in just 12 weeks. Their full-time bootcamps have helped over 2,000 graduates find jobs at more than 850 companies. Their deferred tuition plan means participants pay for the program only after they’ve landed their first web development job.
--------------------------------------------------------------------------------
3. Ironhack
Ironhack offers two full-time bootcamps focused on web design, a 26-week program in web development and a nine-week program in user experience and user interface design. Students can access extensive career development services post-graduation including portfolio building and interview practice; scholarships are available for underrepresented populations and veterans.
--------------------------------------------------------------------------------

...and so on.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM