简体   繁体   English

使用Django(Python)构建博客RSS源

[英]Building blogs RSS feeds using Django (Python)

as titled, am trying to build a small application that will aggregate RSS from different blogs. 作为标题,我正在尝试构建一个小型应用程序,它将聚合来自不同博客的RSS。 Am trying to test out and explore feedparser for this operation, am stuck though trying to write a peace of code that would detect the rss feed. 我试图测试并探索这个操作的feedparser,虽然我试图编写一个可以检测rss feed的代码。

Most people would just enter www.mysite.com/blog which is not exactly the URL to the RSS feed. 大多数人只会输入www.mysite.com/blog,这不是RSS提要的URL。 If there a way for me to detect the RSS feed, am trying to replicate the browser behavior where it can see the RSS URL. 如果我有办法检测RSS提要,我试图复制浏览器行为,它可以看到RSS URL。

any ideas? 有任何想法吗?

Browsers use RSS feed auto-discovery and Atom feed auto-discovery to find feeds on a given web page. 浏览器使用RSS提要自动发现Atom提要自动发现来查找给定网页上的提要。

For example, the question lists are available via an Atom feed which is linked in the HTML header of the associated pages with: 例如, 问题列表可通过Atom订阅源获取,该订阅源链接在关联页面的HTML标题中:

<link rel="alternate" type="application/atom+xml" title="Feed of questions tagged python" href="/feeds/tag/python" />

You'll need to parse out the <link rel="alternate"> tags in a given page to discover these; 您需要解析给定页面中的<link rel="alternate">标签以发现这些标签; anything with an application/atom+xml or application/rss+xml type fits. 任何带有application/atom+xmlapplication/rss+xml类型的东西都适合。

Use something like BeautifulSoup to parse the HTML document and look for the RSS feeds. 使用像BeautifulSoup这样的东西来解析HTML文档并查找RSS提要。 The following is a basic example and not necessarily the most efficient: 以下是一个基本示例,不一定是最有效的:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

rss_links = soup.select('link[type="application/rss+xml"]')
for link in rss_links:
    rss_url = link.get('href')

See the full BeautifulSoup documentation . 查看完整的BeautifulSoup文档

There is a great app exactly for this, is called Feedjack 完全有一个很棒的应用程序,叫做Feedjack

But you will find yourself banging your head to wall when the RSS feed will contain less than 100 chars. 但是当RSS源包含少于100个字符时,你会发现自己正在撞到墙上。

For full control (aggregating exactly what you need) and for websites without any RSS feeds I would recommend Scrapy 为了完全控制(完全聚合你需要的)和没有任何RSS源的网站,我会推荐Scrapy

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM