简体   繁体   中英

Why does my WebClient return a 404 error most of the time, but not always?

I want to get information about a Microsoft Update in my program. However, the server returns a 404 error at about 80 % of the time. I boiled the problematic code down to this console application:

using System;
using System.Net;

namespace WebBug
{
    class Program
    {
        static void Main(string[] args)
        {
            while (true)
            {
                try
                {
                    WebClient client = new WebClient();
                    Console.WriteLine(client.DownloadString("https://support.microsoft.com/api/content/kb/3068708"));
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.Message);
                }
                Console.ReadKey();
            }
        }
    }
}

When I run the code, I have to get through the loop a few times until I get an actual response:

The remote server returned an error: (404) Not found.
The remote server returned an error: (404) Not found.
The remote server returned an error: (404) Not found.
<div kb-title title="Update for customer experience and diagnostic telemetry [...]

I can open and force refresh (Ctrl + F5) the link in my browser as often as I want to, but it'll show fine.

The problem occurs on two different machines with two different internet connections.
I've also tested this case using the Html Agility Pack, but with the same result.
The problem does not occur with other websites. (The root https://support.microsoft.com works fine 100 % of the time)

Why do I get this weird result?

Cookies. It's because of cookies.

As I started to dig into this problem I noticed that the first time I opened the site in a new browser I got a 404, but after refreshing (sometimes once, sometimes a few times) the site continued to work.

That's when I busted out Chrome's Incognito mode and the developer tools.

There wasn't anything too fishy with the network: there was a simple redirect to the https version if you loaded http.

But what I did notice was the cookies changed. This is what I see the first time I loaded the page:

在此输入图像描述

and here's the page after a (or a few) refreshes:

在此输入图像描述

Notice how a few more cookie entries got added? The site must be trying to read those, not finding them, and "blocking" you. This might be a bot-prevention device or bad programming, I'm not sure.

Anyways, here's how to make your code work. This example uses the HttpWebRequest/Response, not WebClient.

string url = "https://support.microsoft.com/api/content/kb/3068708";

//this holds all the cookies we need to add
//notice the values match the ones in the screenshot above
CookieContainer cookieJar = new CookieContainer();
cookieJar.Add(new Cookie("SMCsiteDir", "ltr", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("SMCsiteLang", "en-US", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("smc_f", "upr", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("smcexpsessionticket", "100", "/", ".microsoft.com"));
cookieJar.Add(new Cookie("smcexpticket", "100", "/", ".microsoft.com"));
cookieJar.Add(new Cookie("smcflighting", "wwp", "/", ".microsoft.com"));

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
//attach the cookie container
request.CookieContainer = cookieJar;

//and now go to the internet, fetching back the contents
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
    string site = sr.ReadToEnd();
}

If you remove the request.CookieContainer = cookieJar; , it will fail with a 404, which reproduces your issue.

Most of the legwork for the code example came from this post and this post .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM