繁体   English   中英

Python web 在 Patreon 上用 bs4 刮

[英]Python web scraping with bs4 on Patreon

我编写了一个脚本来查找一些博客并查看是否添加了新帖子。 但是,当我尝试在 Patreon 上执行此操作时,我无法使用 bs4 找到正确的元素。

我们以https://www.patreon.com/cubecoders为例。

假设我想获得“成为赞助人”部分下的独家帖子数量,截至目前为 25 个。

这段代码工作得很好:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)

Output: 25

现在,我想获得最新帖子的标题,即“AMP 2.0.2 中的新功能 - 集成 SCP/SFTP 服务器”。 截至目前。 我在浏览器中检查标题,发现它包含在带有 class 'sc-1di2uql-1 vYcWR' 的 span 标签中。

但是,当我尝试运行此代码时,我无法获取元素:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)

Output: None

我已经尝试使用 XPath 或 CSS 选择器来获取元素,但无法做到。 我认为这可能是因为该站点首先使用 JavaScript 呈现,因此在正确呈现之前我无法访问这些元素。 当我首先使用 Selenium 渲染网站时,我可以在打印页面上的所有 div 标签时看到标题,但是当我只想获得第一个标题时,我无法访问它。

你们知道解决方法吗? 提前致谢!

编辑:在 Selenium 我可以这样做:

from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")


def find_text(divs):
    for div in divs:
        for span in div.find_elements_by_tag_name("span"):
            if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
                return span.text

            
print(find_text(divs))
browser.close()

Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!

当我从一开始就尝试使用 class 'sc-1di2uql-1 vYcWR' 搜索跨度时,它不会给我结果。 可能是 find_elements 方法在嵌套标签的内部看起来不更深吗?

您看到的数据是通过 Ajax 从他们的 API 加载的。 您可以使用requests模块来加载数据。

例如:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': url
}


with requests.session() as s:
    html_text = s.get(url, headers=headers).text
    campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
    data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    # print some information to screen:
    for d in data['data']:
        print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))

印刷:

New in AMP 2.0.2 - Integrated SCP/SFTP server!                         2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal!                                         2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List                                    2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system                                    2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see?                         2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation!                             2020-05-21T12:19:23.000+00:00
Another day, another video tutorial!                                   2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes!                                        2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux                          2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist?      2020-05-04T01:14:39.000+00:00
Well that was unexpected...                                            2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support!                                      2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features                                    2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers!                      2020-03-11T14:53:31.000+00:00
Preparing for Enterprise                                               2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here!                                2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress!                               2020-02-26T17:53:53.000+00:00
Wallpaper!                                                             2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many.                          2020-02-06T15:26:09.000+00:00
Time for a new module!                                                 2020-01-07T13:41:17.000+00:00

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM