简体   繁体   English

如何在不提供凭据的情况下将安全rss订阅源读入SyndicationFeed?

[英]How do I read a secure rss feed into a SyndicationFeed without providing credentials?

For whatever reason, IBM uses https (without requiring credentials) for their RSS feeds. 无论出于何种原因,IBM都使用https(不需要凭据)来获取RSS源。 I'm trying to consume https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en with a .NET 4 SyndicationFeed. 我正在尝试使用.NET 4 SyndicationFeed来使用https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en I can open this feed in a browser and it loads just fine. 我可以在浏览器中打开这个Feed,它加载得很好。 Here's the code: 这是代码:

        using (XmlReader xml = XmlReader.Create("https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en"))
        {
            var items = from item in SyndicationFeed.Load(xml).Items
                        select item;
        }

Here's the exception: 这是例外:

System.Net.WebException was unhandled by user code
Message=The remote server returned an error: (500) Internal Server Error.
Source=System
StackTrace:
   at System.Net.HttpWebRequest.GetResponse()
   at System.Xml.XmlDownloadManager.GetNonFileStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn)
   at System.Xml.XmlReaderSettings.CreateReader(String inputUri, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri, XmlReaderSettings settings, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri)
   at EDN.Util.Test.FeedAggTest.LoadFeedInfoTest() in D:\cdn\trunk\CDN\Dev\Shared\net\EDN.Util\EDN.Util.Test\FeedAggTest.cs:line 126

How do I configure the reader to work with an https feed? 如何配置阅读器以使用https源?

I don't think it has anything to do with security. 我不认为它与安全有任何关系。 A 500 error is a server-side error. 500错误是服务器端错误。 Something in the request generated by XmlReader.Create(url) is confusing the ibm website. XmlReader.Create(url)生成的请求中的某些内容让ibm网站感到困惑。 If it was simply a security issue, as suggested in your question, then you'd expect to get a 403 error, or "Authorization Denied". 如果它只是一个安全问题,正如您的问题中所建议的那样,那么您可能会收到403错误或“授权被拒绝”。 But you got 500, which is an application error. 但你有500,这是一个应用程序错误。

Even so, maybe there's something the client app can do, to avoid confusing the server. 即便如此,也许客户端应用可以做一些事情,以避免混淆服务器。

I looked at the outgoing HTTP request headers, using Fiddler . 我使用Fiddler查看了传出的HTTP请求标头。 For a request generated by IE, the headers look like this: 对于IE生成的请求,标题如下所示:

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Accept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-silverlight, application/x-shockwave-flash, application/x-silverlight-2-b2, */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; .NET CLR 3.5.30729;)
Accept-Encoding: gzip, deflate
Host: www.ibm.com
Connection: Keep-Alive
Cookie: UnicaNIODID=Ww06gyvyPpZ-WPl6K7y; conxnsCookie=en; IBMPOLLCOOKIE=""; UnicaNIODID=QridYHCNf7M-WYM8Usr

For a request from XmlReader.Create(url), the headers look like this: 对于来自XmlReader.Create(url)的请求,标头如下所示:

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Host: www.ibm.com
Connection: Keep-Alive

Quite a difference. 相当不同。 Also, in the response to the latter, I got a Set-Cookie header, in the 500 response, which wasn't present in the response to IE. 另外,在对后者的响应中,我在500响应中得到了一个Set-Cookie标头,这在IE的响应中没有出现。

Based on that I theorized that it was the difference in request headers, in particular the cookie, that was confusing ibm.com. 基于此,我认为它是请求标题的差异,特别是cookie,令ibm.com感到困惑。


I don't know how to convince XmlReader.Create() to embed all the request headers I wanted, including the cookie. 我不知道如何说服XmlReader.Create()嵌入我想要的所有请求标头,包括cookie。 But I know how to do that with an HttpWebRequest. 但我知道如何使用HttpWebRequest做到这一点。 So I used that. 所以我用过它。

There were a few hurdles I had to clear. 我必须清除一些障碍。

  1. I needed the persistent cookie for ibm.com. 我需要ibm.com的持久cookie。 For that I had to resort to ap/invoke of the Win32 InternetGetCookie . 为此我不得不求助于Win32 InternetGetCookie的 ap / invoke。 See the PersistentCookies class attached in the user-contributed content at the bottom of the doc page for WebRequest , for how to do that. 有关如何执行此操作,请参阅WebRequest文档页面底部的用户提供内容中附加的PersistentCookies类。 After attaching the cookie, I was no longer getting 500 errors. 附加cookie后,我不再收到500个错误。 Hooray! 万岁!

  2. But the resulting stream could not be read by XmlReader.Create(). 但是XmlReader.Create()无法读取生成的流。 It looked binary to me. 它对我来说看起来很二元 I realized I needed to de-compress the gzip or deflated content. 我意识到我需要解压缩gzip或缩小的内容。 For that I had to wrap a GZipStream or DeflateStream around the received response stream, and use the decompressing stream for XmlReader. 为此,我必须 围绕收到的响应流包装GZipStream或DeflateStream,并将解压缩流用于XmlReader。 set the AutomaticDecompression property on HttpWebRequest. 在HttpWebRequest上设置AutomaticDecompression属性。 I could have avoided the need for this by not including "gzip, deflate" on the Accept-Encoding header in the outbound request. 我可以通过在出站请求中的Accept-Encoding标头上不包括“gzip,deflate”来避免这种需要。 Actually, after setting the AutomaticDecompression property, those headers are set implicitly in the outbound HTTP Request. 实际上,在设置AutomaticDecompression属性后,这些标头将在出站HTTP请求中隐式设置。

  3. When I did that, I got actual text. 当我这样做时,我得到了实际的文字。 But some of the byte codes were off. 但有些字节代码已关闭。 Next I needed to use the proper text encoding in the TextReader, as indicated in the HttpWebResponse. 接下来,我需要在TextReader中使用正确的文本编码,如HttpWebResponse中所示。

  4. After doing that, I got a sensible string, but the resulting decompressed rss stream caused the XmlReader to choke, with 在这之后,我得到了一个合理的字符串,但是生成的解压缩的rss流导致XmlReader被阻塞,

    ReadElementString method can only be called on elements with simple or empty content. Line 11, position 25.

    I looked and found a small <script> block, at that location, within the <copyright> element in the rss document. 我查看并在rss文档的<copyright>元素内的该位置找到了一个小的<script>块。 It seems IBM is trying to get the browser to "localize" the copyright date by attaching logic that would run in the browser to format the date. 似乎IBM试图通过附加将在浏览器中运行的逻辑来格式化日期,从而使浏览器“本地化”版权日期。 Seems like overkill to me, or even a bug by IBM. 对我来说似乎有点过分,甚至是IBM的错误。 But because the angle bracket within the text node of an element bothered the XmlReader, I removed the script block with a Regex replace. 但是因为元素的文本节点中的尖括号支撑着XmlReader,所以我删除了带有Regex替换的脚本块。


After clearing those hurdles, it worked. 在清除了这些障碍之后,它起了作用。 The .NET app was able to read the RSS stream from that https url. .NET应用程序能够从该https URL读取RSS流。

I didn't do any further testing - to see if varying the Accept header or the Accept-Encoding header would change the behavior. 我没有做任何进一步的测试 - 看看是否改变Accept标头或Accept-Encoding标头会改变行为。 That's for you to figure out, if you care. 如果你关心的话,那就是你要弄清楚的。

The resulting code is below. 结果代码如下。 It's much uglier than your simple 3-liner. 它比你的简单3线更难听。 I don't know how to make it any simpler. 我不知道如何使它变得更简单。

public void Run()
{
    string url;
    url = "https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en";

    HttpWebRequest hwr = (HttpWebRequest) WebRequest.Create(url);
    // attach persistent cookies
    hwr.CookieContainer =
        PersistentCookies.GetCookieContainerForUrl(url);
    hwr.Accept = "text/xml, */*";
    hwr.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-us");
    hwr.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; .NET CLR 3.5.30729;)";
    hwr.KeepAlive = true;
    hwr.AutomaticDecompression = DecompressionMethods.Deflate |
                                 DecompressionMethods.GZip;

    using (var resp = (HttpWebResponse) hwr.GetResponse())
    {
        using(Stream s = resp.GetResponseStream())
        {            
            string cs = String.IsNullOrEmpty(resp.CharacterSet) ? "UTF-8" : resp.CharacterSet;
            Encoding e = Encoding.GetEncoding(cs);

            using (StreamReader sr = new StreamReader(s, e))
            {
                var allXml = sr.ReadToEnd();

                // remove any script blocks - they confuse XmlReader
                allXml = Regex.Replace( allXml,
                                        "(.*)<script type='text/javascript'>.+?</script>(.*)",
                                        "$1$2",
                                        RegexOptions.Singleline);

                using (XmlReader xmlr = XmlReader.Create(new StringReader(allXml)))
                {
                    var items = from item in SyndicationFeed.Load(xmlr).Items
                        select item;
                }
            }
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM