I've written a script that looks up a few blogs and sees if a new post has been added. However, when I try to do this on Patreon I cannot find the right element with bs4.
Let's take https://www.patreon.com/cubecoders for example.
Say I want to get the number of exclusive posts under the 'Become a patron to' section, which would be 25 as of now.
This code works just fine:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)
Output: 25
Now, I want to get the title of the newest post, which would be 'New in AMP 2.0.2 - Integrated SCP/SFTP server.' as of now. I inspect the title in my browser and see that it is contained by a span tag with the class 'sc-1di2uql-1 vYcWR'.
However, when I try to run this code I cannot fetch the element:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)
Output: None
I've already tried to fetch the element with XPath or CSS selector but couldn't do it. I thought it might be because the site is rendered first with JavaScript and thus I cannot access the elements before they are rendered correctly. When I use Selenium to render the site first I can see the title when printing out all div tags on the page but when I want to get only the very first title I can't access it.
Do you guys know a workaround maybe? Thanks in advance!
EDIT: In Selenium I can do this:
from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")
def find_text(divs):
for div in divs:
for span in div.find_elements_by_tag_name("span"):
if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
return span.text
print(find_text(divs))
browser.close()
Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!
When I just try to search for the spans with class 'sc-1di2uql-1 vYcWR' from the start it won't give me the result though. Could it be that the find_elements method does not look deeper inside for nestled tags?
The data you see is loaded via Ajax from their API. You can use requests
module to load the data.
For example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': url
}
with requests.session() as s:
html_text = s.get(url, headers=headers).text
campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some information to screen:
for d in data['data']:
print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))
Prints:
New in AMP 2.0.2 - Integrated SCP/SFTP server! 2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal! 2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List 2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system 2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see? 2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation! 2020-05-21T12:19:23.000+00:00
Another day, another video tutorial! 2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes! 2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux 2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist? 2020-05-04T01:14:39.000+00:00
Well that was unexpected... 2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support! 2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features 2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers! 2020-03-11T14:53:31.000+00:00
Preparing for Enterprise 2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here! 2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress! 2020-02-26T17:53:53.000+00:00
Wallpaper! 2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many. 2020-02-06T15:26:09.000+00:00
Time for a new module! 2020-01-07T13:41:17.000+00:00
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.