简体   繁体   中英

Exceptions with DateTime parsing in RSS feed use SyndicationFeed in c#

I'm trying to parse Rss2, Atom feeds using SyndicationFeed objects. But I'm getting XmlExceptions while parsing DateTime field like pubDate

2012-01-17 08:01:06

public static List<SyndicationItem> getRssData(string url)
{
    List<SyndicationItem> list = new List<SyndicationItem>();

    WebClient client = new WebClient();
    try
    {
        SyndicationFeed feed = SyndicationFeed.Load(XmlReader.Create(url));
        list = (from item in feed.Items select item).ToList();
    }
    catch (Exception e)
    {
        throw e;
    }

    return list;
}

The url link http://news.163.com/special/00011K6L/rss_newstop.xml

<item id="2">
    <title>...</title>
    <link>...</link>
    <description>......</description>
    <pubDate>2012-01-17 12:09:29</pubDate><-----Exception
</item>

Is there a better way to achieve this? Please help. Thanks.

There is a workaround RSS20FeedFormatter throws exception trying to read some DateTime formats .

To work around this problem, create a custom XML reader that recognizes different date formats. The following is an example of a custom XML reader:

XmlReader r = new MyXmlReader(url);
SyndicationFeed feed = SyndicationFeed.Load(r);
Rss20FeedFormatter rssFormatter = feed.GetRss20Formatter();
XmlTextWriter rssWriter = new XmlTextWriter("rss.xml", Encoding.UTF8);
rssWriter.Formatting = Formatting.Indented;
rssFormatter.WriteTo(rssWriter);
rssWriter.Close();

..and class used in previous code:

class MyXmlReader : XmlTextReader
{
    private bool readingDate = false;
    const string CustomUtcDateTimeFormat = "ddd MMM dd HH:mm:ss Z yyyy"; // Wed Oct 07 08:00:07 GMT 2009

    public MyXmlReader(Stream s) : base(s) { }

    public MyXmlReader(string inputUri) : base(inputUri) { }

    public override void ReadStartElement()
    {
        if (string.Equals(base.NamespaceURI, string.Empty, StringComparison.InvariantCultureIgnoreCase) &&
            (string.Equals(base.LocalName, "lastBuildDate", StringComparison.InvariantCultureIgnoreCase) ||
            string.Equals(base.LocalName, "pubDate", StringComparison.InvariantCultureIgnoreCase)))
        {
            readingDate = true;
        }
        base.ReadStartElement();
    }

    public override void ReadEndElement()
    {
        if (readingDate)
        {
            readingDate = false;
        }
        base.ReadEndElement();
    }

    public override string ReadString()
    {
        if (readingDate)
        {
            string dateString = base.ReadString();
            DateTime dt;
            if(!DateTime.TryParse(dateString,out dt))
                dt = DateTime.ParseExact(dateString, CustomUtcDateTimeFormat, CultureInfo.InvariantCulture);
            return dt.ToUniversalTime().ToString("R", CultureInfo.InvariantCulture);
        }
        else
        {
            return base.ReadString();
        }
    }
}

Basically, that RSS feed is invalid. If you look at the RSS 2.0 specification it states that:

All date-times in RSS conform to the Date and Time Specification of RFC 822, with the exception that the year may be expressed with two characters or four characters (four preferred).

The string "2012-01-17 12:09:29" doesn't comply to the "Date and Time" part of RFC 822 . It should be "17 01 2012 12:09:29" or something similar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM