簡體   English   中英

如何從網頁中抓取標題列表?

[英]How to scrape list of titles from a webpage?

我正在嘗試抓取 Udacity 網站上可用的課程列表

https://www.udacity.com/courses/all

該網頁有課程列表。 當然,我正在嘗試獲取名稱,即所有 aria-labels。

我試圖得到它如下,但我沒有得到任何 output:

soup = BeautifulSoup(r.text, "html.parser")
name = soup.find_all("class", class_= "card_container__25DrK")

在此處輸入圖像描述

僅使用初始 html 內容創建湯的問題在於,該站點合理地不會一次加載所有內容,而是動態放置其他課程以降低初始頁面加載時間。 要解決此問題,您可以使用 Selenium 之類的 Python

然后,我們將使用CSS 選擇器到 select h2元素和包含“card_title”的 class 屬性(我查看了該站點上的課程源,看起來像這樣)。

您需要下載 Selenium 的驅動程序,我在 Windows 上使用 Chrome,所以我從可用驅動程序列表(ChromeDriver 104.0.5112.79) 下載了 chromedriver.exe 以獲得最新的穩定版本。

示例代碼:

from bs4 import BeautifulSoup
from selenium import webdriver    

options = webdriver.ChromeOptions()
options.add_argument('--headless')

# I'm using Chrome in this example, you can search online for more on
# how Selenium works. This executable path points to where I downloaded it
browser = webdriver.Chrome(options=options, executable_path=r'C:\Users\User\Downloads\chromedriver_win32\chromedriver.exe')
browser.get("https://www.udacity.com/courses/all")

html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')

# match h2 elements with a class containing "card_title"
for course in soup.select('h2[class*="card_title"]'):
    course_name = course.get_text()
    # do something with course_name, e.g add it to a list
    print(course_name)

browser.quit()

Output:

Data Engineer
Business Analytics
Product Manager
Programming for Data Science with Python
Introduction to Programming
Data Scientist
Data Analyst
C++
React
Blockchain Developer
Self-Driving Car Engineer
Machine Learning DevOps Engineer
Deep Learning
SQL
Front End Web Developer
Full Stack Web Developer
Java Programming
Digital Marketing
Artificial Intelligence for Trading
Data Structures and Algorithms
UX Designer
Java Developer
AWS Machine Learning Engineer
Intermediate Python
AI Programming with Python
Growth Product Manager
Intro to Self-Driving Cars
Cloud DevOps Engineer
Robotics Software Engineer
Deep Reinforcement Learning
Data Architect
Android Kotlin Developer
Computer Vision
Data Analysis and Visualization with Microsoft Power BI
Natural Language Processing
Cloud Developer
Zero Trust Security
Data Streaming
AI Product Manager
Introduction to Cybersecurity
iOS Developer
Data Engineering with Microsoft Azure
Intro to Machine Learning with TensorFlow
AWS Cloud Architect
Full Stack JavaScript Developer
Digital Project Management
Cloud Native Application Architecture
Intro to Machine Learning with PyTorch
Data Product Manager
Flying Car and Autonomous Flight Engineer
Sensor Fusion Engineer
Ethical Hacker
Predictive Analytics For Business
Intermediate JavaScript
Android Basics
Artificial Intelligence
Agile Software Development
Marketing Analytics
Data Visualization
Cloud DevOps using Microsoft Azure
Digital Freelancer
AI for Healthcare
Hybrid Cloud Engineer
Data Science for Business Leaders
AI for Business Leaders
Privacy Engineer
Site Reliability Engineer
Security Engineer
Cloud Developer using Microsoft Azure
Cloud Architect using Microsoft Azure
Machine Learning Engineer for Microsoft Azure
Security Architect
AI Engineer using Microsoft Azure
Data Privacy
Security Analyst
Enterprise Security
Intel® Edge AI for IoT Developers
Cloud Computing for Business Leaders
Programming for Data Science with R
RPA Developer with UiPath
Cybersecurity for Business Leaders
Intro to Information Security
Cyber-Physical Systems Security
Network Security
Getting Started with Google Workspace
Rapid Prototyping
Creating an Analytical Dataset
Problem Solving with Advanced Analytics
Classification Models
Product Design
Segmentation and Clustering
Time Series Forecasting
App Marketing
App Monetization
A/B Testing for Business Analysts
How to Build a Startup
Get Your Startup Started
Managing Remote Teams with Upwork
Google Cloud Digital Leader Training
Cloud Native Fundamentals
Hybrid Cloud Fundamentals
Intro to Data Analysis
SQL for Data Analysis
Database Systems Concepts & Design
Intro to Inferential Statistics
Spark
Data Analysis and Visualization
Cyber-Physical Systems Design & Analysis
Differential Equations in Action
Self-Driving Fundamentals: Featuring Apollo
AWS Machine Learning Foundations Course
Introduction to Machine Learning using Microsoft Azure
AI Fundamentals
Linear Algebra Refresher Course
Machine Learning: Unsupervised Learning
Big Data Analytics in Healthcare
Intel® Edge AI Fundamentals with OpenVINO™
Artificial Intelligence
Secure and Private AI
Model Building and Validation
Data Visualization and D3.js
Machine Learning for Trading
Machine Learning
Intro to Hadoop and MapReduce
Real-Time Analytics with Apache Storm
A/B Testing
Data Analysis with R
Knowledge-Based AI: Cognitive Systems
Introduction to TensorFlow Lite
Introduction to Computer Vision
Intro to TensorFlow for Deep Learning
Eigenvectors and Eigenvalues
Intro to Artificial Intelligence
Artificial Intelligence for Robotics
Intro to Deep Learning with PyTorch
AWS DeepRacer
Reinforcement Learning
Introduction to Machine Learning Course
Product Manager Interview Preparation
Microsoft Power Platform
Web Tooling & Automation
Front End Frameworks
Responsive Web Design Fundamentals
How to Install Android Studio
Android Basics: Multiscreen Apps
Website Performance Optimization
iOS Networking with Swift
JavaScript Design Patterns
Android Basics: User Input
Android Performance
Responsive Images
Xcode Debugging
Gradle for Android and Java
Build Native Mobile Apps with Flutter
JavaScript Promises
UIKit Fundamentals
Android Basics: User Interface
Client-Server Communication
What is Programming?
Building High Conversion Web Forms
Advanced Android App Development
Software Architecture & Design
Authentication & Authorization: OAuth
Intro to iOS App Development with Swift
Introduction to Operating Systems
Android Basics: Networking
Web Accessibility
Android Basics: Data Storage
Scalable Microservices with Kubernetes
Developing Android Apps with Kotlin
Browser Rendering Optimization
Learn Swift Programming Syntax
Offline Web Applications
Kotlin for Android Developers
UX Design for Mobile Developers
Software Development Process
Data Visualization in Tableau
Intro to Progressive Web Apps
Writing READMEs
Software Analysis & Testing
iOS Persistence and Core Data
Computer Networking
Firebase Analytics: iOS
Human-Computer Interaction
2D Game Development with libGDX
Intro to jQuery
How to create <anything> in Android
Introduction to Graduate Algorithms
Dynamic Web Applications with Sinatra
How to Make a Platformer Using libGDX
JavaScript Testing
Object-Oriented JavaScript
Localization Essentials
Compilers: Theory and Practice
HTML5 Canvas
Object Oriented Programming in Java
Designing RESTful APIs
GT - Refresher - Advanced OS
Intro to JavaScript
Grand Central Dispatch (GCD)
Continuous Integration and Deployment
Swift for Beginners
Intro to Statistics
Intro to HTML and CSS
Developing Android Apps
Introduction to Python Programming
Introduction to Virtual Reality
Objective-C for Swift Developers
Interactive 3D Graphics
Full Stack Foundations
High Performance Computer Architecture
AutoLayout
Kotlin Bootcamp for Programmers
Shell Workshop
Core ML: Machine Learning for iOS
Statistics
Intro to Theoretical Computer Science
Design of Computer Programs
Data Wrangling with MongoDB
Swift for Developers
Firebase in a Weekend: Android
Software Debugging
Deploying a Hadoop Cluster
Server-Side Swift
Networking for Web Developers
Intro to Physics
Intro to Relational Databases
ES6 - JavaScript Improved
Mobile Design and Usability for iOS
Intro to AJAX
Intro to Algorithms
The MVC Pattern in Ruby
WeChat Mini Program Development
Asynchronous JavaScript Requests
Embedded Systems
High Performance Computing
HTTP & Web Servers
Advanced Android with Kotlin
Computability, Complexity & Algorithms
Advanced Operating Systems
Passwordless Login Solutions for iOS
Version Control with Git
Firebase in a Weekend: iOS
Intro to Point & Click App Development
Deploying Applications with Heroku
Applied Cryptography
Java Programming Basics
C++ For Programmers
Intro to Backend
JavaScript and the DOM
Firebase Analytics: Android
Configuring Linux Web Servers
How to Make an iOS App
Intro to DevOps
Google Maps APIs
Passwordless Login Solutions for Android
Mobile Design and Usability for Android
iOS Design Patterns
Intro to Psychology
Engagement & Monetization | Mobile Games
Material Design for Android Developers
Craft Your Cover Letter
Refresh Your Resume
Strengthen Your LinkedIn Network & Brand
Data Science Interview Prep
Android Interview Prep
Machine Learning Interview Preparation
Front-End Interview Prep
Full-Stack Interview Prep
Data Structures & Algorithms in Swift
iOS Interview Prep
VR Interview Prep

方法一:

推薦的方法,最快最好的,這得到了所有的標題,你也可以調整代碼來獲得 Udacity 上的所有技能
在 chrome 開發者工具中對 Udacity 課程列表網絡進行基本監控,您會發現他們的列表來自https://www.udacity.com/data/catalog.json ,我們可以獲得純 Z466DEEC76ECDF24D56D4 和 576ECDF24D56ZF36使用 python JSON 模塊非常快速地解析出結果。

import requests
import json


# Get Main course content
url = "https://www.udacity.com/data/catalog.json"
response = requests.get(url)

# Load json data from the response
data_store = json.loads(response.text)

titles = []

# Get the titles
for option in data_store:
    if option['type'] == 'course':
        titles.append(option['payload']['title'])

print(titles)

方法二:

沒有得到所有的標題,在測試時有 272 個標題,這只是 172
你可以只用 BeautifulSoup 和 Json刮掉它。 檢查他們的頁面源你會發現這個標簽<script id="__NEXT_DATA__" type="application/json"> ,它包含了他們網站上 json 中的所有數據。 您可以將其解析為 python 字典並鑽取您的標題。 :-)

from bs4 import BeautifulSoup
import requests
import json

# Get website content
url = "https://www.udacity.com/courses/all"
response = requests.get(url)

# Parse content with html parser and beautiful soup
soup = BeautifulSoup(response.text, "html.parser")
script = soup.find("script", id="__NEXT_DATA__")

# Load json data from script tag text scraped above
data_store = json.loads(script.text)

titles = []

# Get the data from the shared_store key in the data_store dictionary
shared_store = data_store["props"]["pageProps"]["header"]["store"]["__SHARED_STORE__"]

# There are two important keys in the shared_store (popular, schoolToPrograms)
# The titles that are first shown in the site are the contents in the `popular` key
schoolToPrograms = shared_store['schoolToPrograms']
popular = shared_store['popularPrograms']

# I want to add these first because they are the titles shown first in the website
for obj in popular:
    titles.append(obj['name'])

# These are the rest on the contents
for obj in schoolToPrograms:
    for item in obj['items']:
        titles.append(item["name"])

print(titles)

這里的主要問題是 BeautifulSoup 本身僅執行 static 抓取,即僅獲得 static Z4C4ZDBAD5FCA2E7A111 您將需要使用Selenium和 BeautifulSoup 之類的東西來抓取動態生成的 HTML。

您可能會發現以下教程很有用: WebScraping with BeautifulSoup 和 Selenium

此外,您還應該確保目標是正確的標簽。 例如,在您的屏幕截圖中,目標是一個錨標記,因此您的find_all應如下所示:

name = soup.find_all('a', class_='card_container__25DrK')

但是,請檢查您的程序檢索到的 HTML 以確保您定位正確的標簽並指定正確的屬性值。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM