简体   繁体   中英

Getting redirected URL in Apache HttpComponents

I'm using Apache HttpComponents to GET some web pages for some crawled URLs. Many of those URLs actually redirect to different URLs (eg because they have been processed with a URL shortener). Additionally to downloading the content, I would like to resolve the final URLs (ie the URL which provided the downloaded content), or even better, all URLs in the redirect chain.

I have been looking through the API docs, but got no clue, where I could hook. Any hints would be greatly appreciated.

一种方法是通过设置相关参数来关闭自动重定向处理,并通过检查3xx响应自行完成操作,然后从响应“ Location”标头中手动提取重定向位置。

Here's a full demo of how to do it using Apache HttpComponents.

Important Details

You'll need to extend DefaultRedirectStrategy like so:

class SpyStrategy extends DefaultRedirectStrategy {
    public final Deque<URI> history = new LinkedList<>();

    public SpyStrategy(URI uri) {
        history.push(uri);
    }

    @Override
    public HttpUriRequest getRedirect(
            HttpRequest request,
            HttpResponse response,
            HttpContext context) throws ProtocolException {
        HttpUriRequest redirect = super.getRedirect(request, response, context);
        history.push(redirect.getURI());
        return redirect;
    }
}

expand method sends a HEAD request which causes client to collect URIs in spy.history deque as it follows redirects automatically:

public static Deque<URI> expand(String uri) {
    try {
        HttpHead head = new HttpHead(uri);
        SpyStrategy spy = new SpyStrategy(head.getURI());
        DefaultHttpClient client = new DefaultHttpClient();
        client.setRedirectStrategy(spy);
        // FIXME: the following completely ignores HTTP errors:
        client.execute(head);
        return spy.history;
    }
    catch (IOException e) {
        throw new RuntimeException(e);
    }
}

You may want to set maximum number of redirects followed to something reasonable (instead of the default of 100) like so:

        BasicHttpParams params = new BasicHttpParams();
        params.setIntParameter(ClientPNames.MAX_REDIRECTS, 5);
        DefaultHttpClient client = new DefaultHttpClient(params);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM