簡體   English   中英

Python web 在 Patreon 上用 bs4 刮

[英]Python web scraping with bs4 on Patreon

我編寫了一個腳本來查找一些博客並查看是否添加了新帖子。 但是,當我嘗試在 Patreon 上執行此操作時,我無法使用 bs4 找到正確的元素。

我們以https://www.patreon.com/cubecoders為例。

假設我想獲得“成為贊助人”部分下的獨家帖子數量,截至目前為 25 個。

這段代碼工作得很好:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)

Output: 25

現在,我想獲得最新帖子的標題,即“AMP 2.0.2 中的新功能 - 集成 SCP/SFTP 服務器”。 截至目前。 我在瀏覽器中檢查標題,發現它包含在帶有 class 'sc-1di2uql-1 vYcWR' 的 span 標簽中。

但是,當我嘗試運行此代碼時,我無法獲取元素:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)

Output: None

我已經嘗試使用 XPath 或 CSS 選擇器來獲取元素,但無法做到。 我認為這可能是因為該站點首先使用 JavaScript 呈現,因此在正確呈現之前我無法訪問這些元素。 當我首先使用 Selenium 渲染網站時,我可以在打印頁面上的所有 div 標簽時看到標題,但是當我只想獲得第一個標題時,我無法訪問它。

你們知道解決方法嗎? 提前致謝!

編輯:在 Selenium 我可以這樣做:

from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")


def find_text(divs):
    for div in divs:
        for span in div.find_elements_by_tag_name("span"):
            if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
                return span.text

            
print(find_text(divs))
browser.close()

Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!

當我從一開始就嘗試使用 class 'sc-1di2uql-1 vYcWR' 搜索跨度時,它不會給我結果。 可能是 find_elements 方法在嵌套標簽的內部看起來不更深嗎?

您看到的數據是通過 Ajax 從他們的 API 加載的。 您可以使用requests模塊來加載數據。

例如:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': url
}


with requests.session() as s:
    html_text = s.get(url, headers=headers).text
    campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
    data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    # print some information to screen:
    for d in data['data']:
        print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))

印刷:

New in AMP 2.0.2 - Integrated SCP/SFTP server!                         2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal!                                         2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List                                    2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system                                    2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see?                         2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation!                             2020-05-21T12:19:23.000+00:00
Another day, another video tutorial!                                   2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes!                                        2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux                          2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist?      2020-05-04T01:14:39.000+00:00
Well that was unexpected...                                            2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support!                                      2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features                                    2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers!                      2020-03-11T14:53:31.000+00:00
Preparing for Enterprise                                               2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here!                                2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress!                               2020-02-26T17:53:53.000+00:00
Wallpaper!                                                             2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many.                          2020-02-06T15:26:09.000+00:00
Time for a new module!                                                 2020-01-07T13:41:17.000+00:00

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM