I'm trying to retrieve links from this page: http://www.seas.harvard.edu/academics/areas
There is a link named "Computer Science" in the middle of the page. Its underlying link is given as "/academics/areas/computer-science". I'm able to convert it to an absolute URL with the Java built-in URL class, obtaining " http://www.seas.harvard.edu/academics/areas/computer-science ".
When I click the link in Chrome browser, however, the absolute URL changes to " http://www.seas.harvard.edu/computer-science ".
So my question is two-fold:
I need to obtain the URL after redirect because I want to read the source code of the page but the URL before redirect doesn't work for me. I'm using the JSoup
library to read from the URL so I suspect it might be a javascript-based redirect.
From curl --dump-header [file] [URL]
the file looked like:
HTTP/1.1 301 Moved Permanently
Age: 0
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Content-Type: text/html
Date: Tue, 13 Aug 2013 13:00:12 GMT
ETag: "1376398812"
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Last-Modified: Tue, 13 Aug 2013 13:00:12 GMT
Location: http://www.seas.harvard.edu/computer-science
Server: nginx
Vary: Accept-Encoding
Via: 1.1 varnish
X-AH-Environment: prod
X-Cache: MISS
X-Drupal-Cache: MISS
X-Redirect-ID: 44
X-Varnish: 2704315535
transfer-encoding: chunked
Connection: keep-alive
As you can see this is a 301 permanent redirect served from the server.
You can use HttpURLConnection to connect, but before connecting, call myConn.setInstanceFollowRedirects(true)
. Redirects are followed and you can get your output stream and read it.
You can use HttpURLConnection
to connect, but before connecting, call myConn.setInstanceFollowRedirects(false)
to not follow redirects. This will save the actual URL in the right place.
The trick here is that for some odd reason, HttpURLConnection doesn't allow to retrieve a header by name unless you parse it as a date.
So, you will need to iterate an integer, calling getHeaderFieldKey
after making the connection and checking if it equal to Location
and if it is, getting getHeaderField
with the same integer to get the location. Annoying, I know. But a location isn't a date and this is a JRE oversight.
I used Fiddler to investigate and the site return for link http://www.seas.harvard.edu/academics/areas/computer-science
HTTP 301 response code , that performs the redirect.
I you want to get real URL. You should perform real request to harvard.edu web server and parse response. (Redirect URL is located in Location
key in HTTP Header).
Sorry about your second question. I don't have skill in Java.
This SO question may help ( httpclient-4-how-to-capture-last-redirect-url )
.htaccess
and mod_rewrite
redirect. Using Firefox's Console I could see the requests. As you can see below the server is sending back a 301 Moved Permanently
message. This tells the browser to redirect to the address returned in the Location
header of the response. Location
header of the response. I can only attempt to address Q1 since I'm not a Java programmer. The source code says they're using Drupal, so I speculate that they're using Drupal's global redirect module (SO discussion about Drupal redirect module here ). Looking at the module's documentation might shed some light on how to obtain the correct url with Java.
There's also numerous ways within javascript to have url requests automatically redirect to some base page (eg, CS homepage), while physically navigating the site allows the user to advance to new pages. This is standard practice in many single page web apps. If this is the case, then @hexafraction 's suggestion might be able to help you retrieve the desired url, though I'm unfamiliar with the Java methods (s)he is suggesting.
You can get the Redirect URL
from the below code setting followRedirects
to false
.
You will get the source code of the redirected page if you set it to true
and that's the default behavior of Jsoup
Connection con = Jsoup.connect("http://www.seas.harvard.edu/academics/areas/computer-science")
.userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
.followRedirects(false);
System.out.println("Redirected Url : " + con.execute().header("Location")); //null if followRedirect is true
Document doc = con.get();
System.out.println(doc.html());
System.out.println("=================================================");
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.