简体   繁体   中英

How does web page redirect work in this page?

I'm trying to retrieve links from this page: http://www.seas.harvard.edu/academics/areas

There is a link named "Computer Science" in the middle of the page. Its underlying link is given as "/academics/areas/computer-science". I'm able to convert it to an absolute URL with the Java built-in URL class, obtaining " http://www.seas.harvard.edu/academics/areas/computer-science ".

When I click the link in Chrome browser, however, the absolute URL changes to " http://www.seas.harvard.edu/computer-science ".

So my question is two-fold:

  1. How does the URL redirect work in this page?
  2. Is there any library or method in Java that would help me obtain the URL after redirect?

I need to obtain the URL after redirect because I want to read the source code of the page but the URL before redirect doesn't work for me. I'm using the JSoup library to read from the URL so I suspect it might be a javascript-based redirect.

From curl --dump-header [file] [URL] the file looked like:

HTTP/1.1 301 Moved Permanently
Age: 0
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Content-Type: text/html
Date: Tue, 13 Aug 2013 13:00:12 GMT
ETag: "1376398812"
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Last-Modified: Tue, 13 Aug 2013 13:00:12 GMT
Location: http://www.seas.harvard.edu/computer-science
Server: nginx
Vary: Accept-Encoding
Via: 1.1 varnish
X-AH-Environment: prod
X-Cache: MISS
X-Drupal-Cache: MISS
X-Redirect-ID: 44
X-Varnish: 2704315535
transfer-encoding: chunked
Connection: keep-alive

As you can see this is a 301 permanent redirect served from the server.

To obtain the data:

You can use HttpURLConnection to connect, but before connecting, call myConn.setInstanceFollowRedirects(true) . Redirects are followed and you can get your output stream and read it.

To obtain the URL itself:

You can use HttpURLConnection to connect, but before connecting, call myConn.setInstanceFollowRedirects(false) to not follow redirects. This will save the actual URL in the right place.

The trick here is that for some odd reason, HttpURLConnection doesn't allow to retrieve a header by name unless you parse it as a date.

So, you will need to iterate an integer, calling getHeaderFieldKey after making the connection and checking if it equal to Location and if it is, getting getHeaderField with the same integer to get the location. Annoying, I know. But a location isn't a date and this is a JRE oversight.

I used Fiddler to investigate and the site return for link http://www.seas.harvard.edu/academics/areas/computer-science HTTP 301 response code , that performs the redirect.

I you want to get real URL. You should perform real request to harvard.edu web server and parse response. (Redirect URL is located in Location key in HTTP Header).

Sorry about your second question. I don't have skill in Java.

This SO question may help ( httpclient-4-how-to-capture-last-redirect-url )

  1. There is probably eg a .htaccess and mod_rewrite redirect. Using Firefox's Console I could see the requests. As you can see below the server is sending back a 301 Moved Permanently message. This tells the browser to redirect to the address returned in the Location header of the response. 网络请求
  2. The way you obtain the changed URL depends on the way you load the page:
    • If you use ready libraries & code to load the page to eg a DOM object, the you could use that ready HTTP system to load the response, this will probably result to it automatically redirecting -> you will get the URL from the URL of the loaded page. If it does not do that, then you must check for status code 301 or 302 and when those are received then the changed URL is in the Location header of the response.
    • If you have your own code written to load the response via TCP sockets, then you must just load the response as normal, but again check for the 301 and 302 status codes and do as described in the previous section.

I can only attempt to address Q1 since I'm not a Java programmer. The source code says they're using Drupal, so I speculate that they're using Drupal's global redirect module (SO discussion about Drupal redirect module here ). Looking at the module's documentation might shed some light on how to obtain the correct url with Java.

There's also numerous ways within javascript to have url requests automatically redirect to some base page (eg, CS homepage), while physically navigating the site allows the user to advance to new pages. This is standard practice in many single page web apps. If this is the case, then @hexafraction 's suggestion might be able to help you retrieve the desired url, though I'm unfamiliar with the Java methods (s)he is suggesting.

You can get the Redirect URL from the below code setting followRedirects to false .

You will get the source code of the redirected page if you set it to true and that's the default behavior of Jsoup

 Connection con = Jsoup.connect("http://www.seas.harvard.edu/academics/areas/computer-science")
                              .userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
                              .followRedirects(false);

           System.out.println("Redirected Url : " + con.execute().header("Location")); //null if followRedirect is true

           Document doc = con.get();
           System.out.println(doc.html());
           System.out.println("=================================================");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM