使用python Regex從rss feed中提取內容

Question

我正在嘗試使用正則表達式，尤其是re模塊，以從rss feed中提取標題，日期和內容。 到目前為止，我已經使用以下代碼：

    titles = re.findall(r'<title>(.*?)</title>',html_code)
    descriptions = re.findall(r'<description>(.*?)</description>',html_code)   
    dates = re.findall(r'<pubDate>(.*?)</pubDate>',html_code)

    for title in titles:
        if 'The Guardian' in title:
            pass
        else:
            print "Headline:" ,title
            print


    for description in descriptions:
        if 'Latest news and features from theguardian.com' in description:
            pass
        else:
            print "Description:" ,description
            print

    for date in dates:
        print "Date:" ,date
        print

此代碼提供以下輸出：

Headline: Tim Bresnan denies involvement in Kevin Pietersen parody Twitter account

Description: I 100% did NOT have any password, and wasnt involved&lt;br /&gt; ECB confirms Alec Stewart reported incident in 2012 &lt;br /&gt;&lt;a href="http://www.theguardian.com/sport/2014/oct/08/kevin-pietersen-parody-twitter-account-author-denies-england-players-involved" title=""&gt; Twitter account author denies players were involved&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.theguardian.com/sport/blog/2014/oct/08/ecb-england-cricket-kevin-pietersen-tom-harrison" title=""&gt; Owen Gibson: ECB at crossroads amid fallout&lt;/a&gt;&lt;p&gt;Tim Bresnan has denied having any involvement in the controversial @KPgenius Twitter account after Kevin Pietersens autobiography claimed his former England team-mates were behind it.&lt;/p&gt;&lt;p&gt;In his book, Pietersen revealed the extent to which the account had angered and upset him, and claimed that the accounts author had told the former England wicketkeeper Alec Stewart that some of the guys in the dressing room are tweeting from it.&lt;/p&gt;&lt;p&gt;Disappointed to be implicated in the &lt;a href="https://twitter.com/hashtag/kpgenius?src=hash"&gt;#kpgenius&lt;/a&gt; account. I 100% did NOT have any password. And wasn't involved In any posting.&lt;/p&gt; &lt;a href="http://www.theguardian.com/sport/2014/oct/09/tim-bresnan-kevin-pietersen-parody-twitter"&gt;Continue reading...&lt;/a&gt;           

Date: Thu, 09 Oct 2014 11:56:43 GMT

將為每個新聞文章打印這些結果。 我的問題是，我該如何清理內容部分並刪除所有html垃圾？ 我只需要一些沒有所有標簽的基本信息。 我如何使用正則表達式刪除這些表達式（例如，鏈接和“。＆lt; / p＆gt;”））？ 謝謝

Answer 1

您可以使用str.replace()將特殊的HTML字符替換為所需的替換內容。

使用python Regex從rss feed中提取內容

問題描述

1 個解決方案

解決方案1
-1 2015-10-08 20:59:34

使用python Regex從rss feed中提取內容

問題描述

1 個解決方案

解決方案1 -1 2015-10-08 20:59:34

解決方案1
-1 2015-10-08 20:59:34