Python web scraping with bs4 on Patreon

Question

I've written a script that looks up a few blogs and sees if a new post has been added. However, when I try to do this on Patreon I cannot find the right element with bs4.

Let's take https://www.patreon.com/cubecoders for example.

Say I want to get the number of exclusive posts under the 'Become a patron to' section, which would be 25 as of now.

This code works just fine:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)

Output: 25

Now, I want to get the title of the newest post, which would be 'New in AMP 2.0.2 - Integrated SCP/SFTP server.' as of now. I inspect the title in my browser and see that it is contained by a span tag with the class 'sc-1di2uql-1 vYcWR'.

However, when I try to run this code I cannot fetch the element:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)

Output: None

I've already tried to fetch the element with XPath or CSS selector but couldn't do it. I thought it might be because the site is rendered first with JavaScript and thus I cannot access the elements before they are rendered correctly. When I use Selenium to render the site first I can see the title when printing out all div tags on the page but when I want to get only the very first title I can't access it.

Do you guys know a workaround maybe? Thanks in advance!

EDIT: In Selenium I can do this:

from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")


def find_text(divs):
    for div in divs:
        for span in div.find_elements_by_tag_name("span"):
            if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
                return span.text

            
print(find_text(divs))
browser.close()

Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!

When I just try to search for the spans with class 'sc-1di2uql-1 vYcWR' from the start it won't give me the result though. Could it be that the find_elements method does not look deeper inside for nestled tags?

Answer 1

The data you see is loaded via Ajax from their API. You can use requests module to load the data.

For example:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': url
}


with requests.session() as s:
    html_text = s.get(url, headers=headers).text
    campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
    data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    # print some information to screen:
    for d in data['data']:
        print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))

Prints:

New in AMP 2.0.2 - Integrated SCP/SFTP server!                         2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal!                                         2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List                                    2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system                                    2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see?                         2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation!                             2020-05-21T12:19:23.000+00:00
Another day, another video tutorial!                                   2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes!                                        2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux                          2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist?      2020-05-04T01:14:39.000+00:00
Well that was unexpected...                                            2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support!                                      2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features                                    2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers!                      2020-03-11T14:53:31.000+00:00
Preparing for Enterprise                                               2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here!                                2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress!                               2020-02-26T17:53:53.000+00:00
Wallpaper!                                                             2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many.                          2020-02-06T15:26:09.000+00:00
Time for a new module!                                                 2020-01-07T13:41:17.000+00:00

Python web scraping with bs4 on Patreon

Question

1 answers

solution1
1 2020-07-18 15:51:45

Python web scraping with bs4 on Patreon

Question

1 answers

solution1 1 2020-07-18 15:51:45

solution1
1 2020-07-18 15:51:45