C# NET.WebClient DownloadString() Issue - Page redirects

Question

I have this problem - I am writing a simple web spider and it works good so far. Problem is the site I am working on has the nasty habit of redirecting or adding stuff to the address sometimes. In some pages it adds "/about" after you load them and on some it totally redirects to another page. The webclient gets confused since it downloads the html code and starts to parse the links, but since many of them are in the format "../../something", it simply crashes after a while, because it calculates the link according to the first given address(before redirecting or adding "/about"). When the newly created page comes out of the queue it throws 404 Not Found exception(surpriiise).

Now I can just add "/about" to every page myself, but for shits and giggles, the website itself doesn't always add it...

I would appreciate any ideas. Thank you for your time and all best!

Answer 1

If you want to get the redirected URI of a page for parsing the links inside it, use a subclass of WebClient like this:

class MyWebClient : WebClient
{
   Uri _responseUri;

    public Uri ResponseUri
    {
        get { return _responseUri; }
    }

    protected override WebResponse GetWebResponse(WebRequest request)
    {
        WebResponse response = base.GetWebResponse(request);
        _responseUri = response.ResponseUri;
        return response;
    }
}

Now use MyWebClient instead of WebClient and parse the links using ResponseUri

C# NET.WebClient DownloadString() Issue - Page redirects

Question

1 answers

solution1
4 ACCPTED 2013-03-15 09:33:10

C# NET.WebClient DownloadString() Issue - Page redirects

Question

1 answers

solution1 4 ACCPTED 2013-03-15 09:33:10

solution1
4 ACCPTED 2013-03-15 09:33:10