简体   繁体   English

JSOUP从重定向链接获取html内容

[英]JSOUP get the html content from redirected link

consider the following url http://www.google.com/url?rct=j&sa=t&url=http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings&ct=ga&cd=CAIyHWU3NmVhMGQ0NWQ3MmRmY2I6Y29tOmVuOlVTOlJM&usg=AFQjCNE_8XwECqkmyPIMzcSxCDh2hP16wQ . 请考虑以下网址http://www.google.com/url?rct=j&sa=t&url=http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-作证更好的洪水警告&ct = ga&cd = CAIyHWU3NmVhMGQ0NWQ3MmRmY2I6Y29tOmVuOlVTOlJM&usg = AFQjCNE_8XwECqkmyPIMzcSxCDh2hP16wQ When i pass this url to JSOUP , the html content is not accurate. 当我将此url传递给JSOUP ,html内容不准确。 But when i open this url in browser, it will rediect to http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings . 但是当我在浏览器中打开这个网址时,它会重新发送到http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings

Then, i passed this url to jsoup , now i am getting the exact html content. 然后,我将此url传递给jsoup ,现在我得到了确切的html内容。

How can i get the exact html content from the first url ?? 我怎样才能从第一个url获得确切的html内容?

I have tried many options 我尝试了很多选择

        Response response = Jsoup.connect(url).followRedirects(true).timeout(timeOut*1000).userAgent(userAgent).execute();
        int status = response.statusCode();
        if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM || status == HttpURLConnection.HTTP_SEE_OTHER) {
            redirectUrl = response.header("location");
            response = Jsoup.connect(redirectUrl).followRedirects(false).timeout(timeOut*1000).userAgent(userAgent).execute();
        }
        Document doc=response.parse();

I tried many user agents , .referrer("http://google.com") options etc. I am currently using jsoup version 1.8.3. 我尝试了很多user agents.referrer("http://google.com")选项等。我目前正在使用jsoup版本1.8.3。

Google returns an html page with a JavaScript/META redirect: Google会返回一个包含JavaScript / META重定向的html页面:

<script>window.googleJavaScriptRedirect=1</script><script>var n={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};n.navigateTo(window.parent,window,"http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings");
</script><noscript><META http-equiv="refresh" content="0;URL='http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings'"></noscript>

That is different from HTTP redirect headers and since Jsoup does not interpret JavaScript you are out of luck. 这与HTTP重定向标头不同,因为Jsoup不解释JavaScript你运气不好。

However, you can of course parse this to get the real link. 但是,你当然可以解析这个以获得真正的链接。 This is of course already possible without accessing Google, since the link is part of the parameters in the original URL. 当然,这可以在不访问Google的情况下实现,因为该链接是原始URL中参数的一部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM