简体   繁体   中英

How to extract RSS links from website with Python

I am trying to extract all RSS feed links from some websites. Ofc if RSS exists. These are some website links that have RSS, and below is list of RSS links from those websites.

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/",
"https://www.spiegel.de/", 
"https://www.blick.ch/",
"https://www.berliner-zeitung.de/", 
"https://www.ostsee-zeitung.de/", 
"https://www.kleinezeitung.at/", 
"https://www.blick.ch/", 
"https://www.ksta.de/", 
"https://www.tagblatt.ch/", 
"https://www.srf.ch/", 
"https://www.derstandard.at/"]


website_rss_links = ["https://www.diepresse.com/rss/Kunst", 
"https://rss.sueddeutsche.de/rss/Kultur", 
"https://www.berliner-zeitung.de/feed.id_kultur-kunst.xml", 
"https://www.aargauerzeitung.ch/leben-kultur.rss", 
"https://www.luzernerzeitung.ch/kultur.rss", 
"https://www.nzz.ch/technologie.rss", 
"https://www.spiegel.de/kultur/literatur/index.rss", 
"https://www.luzernerzeitung.ch/wirtschaft.rss", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://www.berliner-zeitung.de/feed.id_abgeordnetenhauswahl.xml", 
"https://www.ostsee-zeitung.de/arc/outboundfeeds/rss/category/wissen/", 
"https://www.kleinezeitung.at/rss/politik", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://feed.ksta.de/feed/rss/politik/index.rss", 
"https://www.tagblatt.ch/wirtschaft.rss", 
"https://www.srf.ch/news/bnf/rss/1926", 
"https://www.derstandard.at/rss/wirtschaft"]

My approach is to extract all links, and then check if some of them has RSS in them, but that is just a first step:

for url in all_links:
    
    response = requests.get(url)
    print(response)
    soup = BeautifulSoup(response.content, 'html.parser')
    list_of_links = soup.select("a[href]")
    list_of_links = [link["href"] for link in list_of_links]
    print("Number of links", len(list_of_links))
 

    for l in list_of_links:
        if "rss" in l:
            print(url)
            print(l)
    print()
    

I have heard that I can look for RSS links like this, but I do not know how to incorporate this in my code.

type=application/rss+xml

My goal is to get working RSS urls at the end. Maybe it is an issue because I am sending request on the first page, and maybe I should crawl different pages in order to extract all RSS Links, but I hope that there is a faster/better way for RSS extraction.

You can see that RSS links have or end up with (for example):

.rss
/rss
/rss/
rss.xml
/feed/
rss-feed

etc.

Don't reinvent the wheel, there are many curated directories and collections that can serve you well and give you a nice introduction.

However, to follow your approach, you should first collect all the links on the page that could point to an rss feed:

soup.select('a[href*="rss"],a[href*="/feed"],a:-soup-contains-own("RSS")')

and then verify again whether it is one or just a collection page:

soup.select('[type="application/rss+xml"],a[href*=".rss"]')

or checking the content-type :

if 'xml' in requests.get(rss).headers.get('content-type'):

Note: This is just to point in a direction, cause there a lot of pattern that are used to mark such feeds - rss, feed, feed/, news, xml,... and also the content-type is provided differently by servers

Example

import requests, re
from bs4 import BeautifulSoup

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/technologie/",
"https://www.spiegel.de/", 
"https://www.blick.ch/wirtschaft/"]

rss_feeds = []

def check_for_real_rss(url):
    base_url = re.search('^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)',url).group(0)
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    for e in soup.select('[type="application/rss+xml"],a[href*=".rss"],a[href$="feed"]'):
        if e.get('href').startswith('/'):
            rss = (base_url+e.get('href'))
        else:
            rss = (e.get('href'))
        if 'xml' in requests.get(rss).headers.get('content-type'):
            rss_feeds.append(rss)

for url in website_links:
    soup = BeautifulSoup(requests.get(url).text)
    for e in soup.select('a[href*="rss"],a[href*="/feed"],a:-soup-contains-own("RSS")'):
        if e.get('href').startswith('/'):
            check_for_real_rss(url.strip('/')+e.get('href'))
        else:
            check_for_real_rss(e.get('href'))
set(rss_feeds)

Output

{'https://rss.sueddeutsche.de/app/service/rss/alles/index.rss?output=rss','https://rss.sueddeutsche.de/rss/Topthemen',
 'https://www.aargauerzeitung.ch/aargau/aarau.rss',
 'https://www.aargauerzeitung.ch/aargau/baden.rss',
 'https://www.aargauerzeitung.ch/leben-kultur.rss',
 'https://www.aargauerzeitung.ch/schweiz-welt.rss',
 'https://www.aargauerzeitung.ch/sport.rss',
 'https://www.bzbasel.ch/basel.rss',
 'https://www.grenchnertagblatt.ch/solothurn/grenchen.rss',
 'https://www.jetzt.de/alle_artikel.rss',
 'https://www.limmattalerzeitung.ch/limmattal.rss',
 'https://www.luzernerzeitung.ch/international.rss',
 'https://www.luzernerzeitung.ch/kultur.rss',
 'https://www.luzernerzeitung.ch/leben.rss',
 'https://www.luzernerzeitung.ch/leben/ratgeber.rss',...}

You can use BS to extract the RSS links:

import requests
from bs4 import BeautifulSoup

# Send an HTTP request to the website
response = requests.get("https://www.sueddeutsche.de/")

# Check the status code of the response
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all elements that contain RSS links
     links = soup.find_all("link", type="application/rss+xml")

    # Extract the RSS links from the elements and print them
    for link in links:
        rss_link = link["href"]
        print(rss_link)

Please be informed that some websites have security against these scrapers.

search for type="application/rss+xml" links

like

<link href="/feeds" rel="alternate" title="RSS feed" type="application/rss+xml">

<link rel="alternate" type="application/rss+xml" title="DER SPIEGEL | RSS Schlagzeilen" href="https://www.spiegel.de/schlagzeilen/index.rss">

<link rel="alternate" type="application/rss+xml" title="DER SPIEGEL | RSS Nachrichten" href="https://www.spiegel.de/index.rss">

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM