[英]How does web page redirect work in this page?
I'm trying to retrieve links from this page: http://www.seas.harvard.edu/academics/areas 我正在尝试从此页面检索链接: http : //www.seas.harvard.edu/academics/areas
There is a link named "Computer Science" in the middle of the page. 页面中间有一个名为“计算机科学”的链接。 Its underlying link is given as "/academics/areas/computer-science".
它的基础链接是“/学术/领域/计算机科学”。 I'm able to convert it to an absolute URL with the Java built-in URL class, obtaining " http://www.seas.harvard.edu/academics/areas/computer-science ".
我可以使用Java内置URL类将其转换为绝对URL,获取“ http://www.seas.harvard.edu/academics/areas/computer-science ”。
When I click the link in Chrome browser, however, the absolute URL changes to " http://www.seas.harvard.edu/computer-science ". 但是,当我点击Chrome浏览器中的链接时,绝对网址会更改为“ http://www.seas.harvard.edu/computer-science ”。
So my question is two-fold: 所以我的问题是双重的:
I need to obtain the URL after redirect because I want to read the source code of the page but the URL before redirect doesn't work for me. 我需要在重定向后获取URL,因为我想读取页面的源代码,但重定向之前的URL对我不起作用。 I'm using the
JSoup
library to read from the URL so I suspect it might be a javascript-based redirect. 我正在使用
JSoup
库来读取URL,所以我怀疑它可能是基于javascript的重定向。
From curl --dump-header [file] [URL]
the file looked like: 从
curl --dump-header [file] [URL]
文件看起来像:
HTTP/1.1 301 Moved Permanently
Age: 0
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Content-Type: text/html
Date: Tue, 13 Aug 2013 13:00:12 GMT
ETag: "1376398812"
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Last-Modified: Tue, 13 Aug 2013 13:00:12 GMT
Location: http://www.seas.harvard.edu/computer-science
Server: nginx
Vary: Accept-Encoding
Via: 1.1 varnish
X-AH-Environment: prod
X-Cache: MISS
X-Drupal-Cache: MISS
X-Redirect-ID: 44
X-Varnish: 2704315535
transfer-encoding: chunked
Connection: keep-alive
As you can see this is a 301 permanent redirect served from the server. 如您所见,这是从服务器提供的301永久重定向。
You can use HttpURLConnection to connect, but before connecting, call myConn.setInstanceFollowRedirects(true)
. 您可以使用HttpURLConnection进行连接,但在连接之前,请调用
myConn.setInstanceFollowRedirects(true)
。 Redirects are followed and you can get your output stream and read it. 遵循重定向,您可以获取输出流并读取它。
You can use HttpURLConnection
to connect, but before connecting, call myConn.setInstanceFollowRedirects(false)
to not follow redirects. 您可以使用
HttpURLConnection
进行连接,但在连接之前,请调用myConn.setInstanceFollowRedirects(false)
以不遵循重定向。 This will save the actual URL in the right place. 这会将实际的URL保存在正确的位置。
The trick here is that for some odd reason, HttpURLConnection doesn't allow to retrieve a header by name unless you parse it as a date. 这里的诀窍是,由于某些奇怪的原因,HttpURLConnection不允许按名称检索标头,除非您将其解析为日期。
So, you will need to iterate an integer, calling getHeaderFieldKey
after making the connection and checking if it equal to Location
and if it is, getting getHeaderField
with the same integer to get the location. 因此,您需要迭代一个整数,在建立连接后调用
getHeaderFieldKey
并检查它是否等于Location
,如果是,则获取具有相同整数的getHeaderField
以获取该位置。 Annoying, I know. 很烦人,我知道。 But a location isn't a date and this is a JRE oversight.
但是,地点不是约会,这是JRE的疏忽。
I used Fiddler to investigate and the site return for link http://www.seas.harvard.edu/academics/areas/computer-science
HTTP 301 response code , that performs the redirect. 我使用Fiddler进行调查,网站返回链接
http://www.seas.harvard.edu/academics/areas/computer-science
HTTP 301响应代码 ,执行重定向。
I you want to get real URL. 我想获得真正的URL。 You should perform real request to harvard.edu web server and parse response.
您应该对harvard.edu Web服务器执行实际请求并解析响应。 (Redirect URL is located in
Location
key in HTTP Header). (重定向URL位于HTTP标头中的
Location
键)。
Sorry about your second question. 抱歉你的第二个问题。 I don't have skill in Java.
我没有Java技能。
This SO question may help ( httpclient-4-how-to-capture-last-redirect-url ) 这个SO问题可能会有所帮助( httpclient-4-how-to-capture-last-redirect-url )
.htaccess
and mod_rewrite
redirect. .htaccess
和mod_rewrite
重定向。 Using Firefox's Console I could see the requests. 301 Moved Permanently
message. 301 Moved Permanently
消息。 This tells the browser to redirect to the address returned in the Location
header of the response. Location
标头中返回的地址。 Location
header of the response. Location
标头中。 I can only attempt to address Q1 since I'm not a Java programmer. 因为我不是Java程序员,所以我只能尝试解决Q1问题。 The source code says they're using Drupal, so I speculate that they're using Drupal's global redirect module (SO discussion about Drupal redirect module here ).
源代码说他们正在使用Drupal,所以我猜测他们正在使用Drupal的全局重定向模块 ( 这里讨论关于Drupal重定向模块的讨论)。 Looking at the module's documentation might shed some light on how to obtain the correct url with Java.
查看模块的文档可能会说明如何使用Java获取正确的URL。
There's also numerous ways within javascript to have url requests automatically redirect to some base page (eg, CS homepage), while physically navigating the site allows the user to advance to new pages. 在javascript中还有许多方法可以让url请求自动重定向到某个基页(例如,CS主页),而物理导航网站则允许用户前进到新页面。 This is standard practice in many single page web apps.
这是许多单页Web应用程序中的标准做法。 If this is the case, then @hexafraction 's suggestion might be able to help you retrieve the desired url, though I'm unfamiliar with the Java methods (s)he is suggesting.
如果是这种情况,那么@hexafraction的建议可能会帮助您检索所需的URL,尽管我不熟悉他建议的Java方法。
You can get the Redirect URL
from the below code setting followRedirects
to false
. 您可以从以下代码设置
followRedirects
获取Redirect URL
为false
。
You will get the source code of the redirected page if you set it to true
and that's the default behavior of Jsoup
如果将其设置为
true
,那么您将获得重定向页面的源代码,这是Jsoup
的默认行为
Connection con = Jsoup.connect("http://www.seas.harvard.edu/academics/areas/computer-science")
.userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
.followRedirects(false);
System.out.println("Redirected Url : " + con.execute().header("Location")); //null if followRedirect is true
Document doc = con.get();
System.out.println(doc.html());
System.out.println("=================================================");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.