简体   繁体   English

网页重定向如何在此页面中工作?

[英]How does web page redirect work in this page?

I'm trying to retrieve links from this page: http://www.seas.harvard.edu/academics/areas 我正在尝试从此页面检索链接: http//www.seas.harvard.edu/academics/areas

There is a link named "Computer Science" in the middle of the page. 页面中间有一个名为“计算机科学”的链接。 Its underlying link is given as "/academics/areas/computer-science". 它的基础链接是“/学术/领域/计算机科学”。 I'm able to convert it to an absolute URL with the Java built-in URL class, obtaining " http://www.seas.harvard.edu/academics/areas/computer-science ". 我可以使用Java内置URL类将其转换为绝对URL,获取“ http://www.seas.harvard.edu/academics/areas/computer-science ”。

When I click the link in Chrome browser, however, the absolute URL changes to " http://www.seas.harvard.edu/computer-science ". 但是,当我点击Chrome浏览器中的链接时,绝对网址会更改为“ http://www.seas.harvard.edu/computer-science ”。

So my question is two-fold: 所以我的问题是双重的:

  1. How does the URL redirect work in this page? URL重定向如何在此页面中工作?
  2. Is there any library or method in Java that would help me obtain the URL after redirect? Java中是否有任何库或方法可以帮助我在重定向后获取URL?

I need to obtain the URL after redirect because I want to read the source code of the page but the URL before redirect doesn't work for me. 我需要在重定向后获取URL,因为我想读取页面的源代码,但重定向之前的URL对我不起作用。 I'm using the JSoup library to read from the URL so I suspect it might be a javascript-based redirect. 我正在使用JSoup库来读取URL,所以我怀疑它可能是基于javascript的重定向。

From curl --dump-header [file] [URL] the file looked like: curl --dump-header [file] [URL]文件看起来像:

HTTP/1.1 301 Moved Permanently
Age: 0
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
Content-Type: text/html
Date: Tue, 13 Aug 2013 13:00:12 GMT
ETag: "1376398812"
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Last-Modified: Tue, 13 Aug 2013 13:00:12 GMT
Location: http://www.seas.harvard.edu/computer-science
Server: nginx
Vary: Accept-Encoding
Via: 1.1 varnish
X-AH-Environment: prod
X-Cache: MISS
X-Drupal-Cache: MISS
X-Redirect-ID: 44
X-Varnish: 2704315535
transfer-encoding: chunked
Connection: keep-alive

As you can see this is a 301 permanent redirect served from the server. 如您所见,这是从服务器提供的301永久重定向。

To obtain the data: 要获取数据:

You can use HttpURLConnection to connect, but before connecting, call myConn.setInstanceFollowRedirects(true) . 您可以使用HttpURLConnection进行连接,但在连接之前,请调用myConn.setInstanceFollowRedirects(true) Redirects are followed and you can get your output stream and read it. 遵循重定向,您可以获取输出流并读取它。

To obtain the URL itself: 要获取URL本身:

You can use HttpURLConnection to connect, but before connecting, call myConn.setInstanceFollowRedirects(false) to not follow redirects. 您可以使用HttpURLConnection进行连接,但在连接之前,请调用myConn.setInstanceFollowRedirects(false)遵循重定向。 This will save the actual URL in the right place. 这会将实际的URL保存在正确的位置。

The trick here is that for some odd reason, HttpURLConnection doesn't allow to retrieve a header by name unless you parse it as a date. 这里的诀窍是,由于某些奇怪的原因,HttpURLConnection不允许按名称检索标头,除非您将其解析为日期。

So, you will need to iterate an integer, calling getHeaderFieldKey after making the connection and checking if it equal to Location and if it is, getting getHeaderField with the same integer to get the location. 因此,您需要迭代一个整数,在建立连接后调用getHeaderFieldKey并检查它是否等于Location ,如果是,则获取具有相同整数的getHeaderField以获取该位置。 Annoying, I know. 很烦人,我知道。 But a location isn't a date and this is a JRE oversight. 但是,地点不是约会,这是JRE的疏忽。

I used Fiddler to investigate and the site return for link http://www.seas.harvard.edu/academics/areas/computer-science HTTP 301 response code , that performs the redirect. 我使用Fiddler进行调查,网站返回链接http://www.seas.harvard.edu/academics/areas/computer-science HTTP 301响应代码 ,执行重定向。

I you want to get real URL. 我想获得真正的URL。 You should perform real request to harvard.edu web server and parse response. 您应该对harvard.edu Web服务器执行实际请求并解析响应。 (Redirect URL is located in Location key in HTTP Header). (重定向URL位于HTTP标头中的Location键)。

Sorry about your second question. 抱歉你的第二个问题。 I don't have skill in Java. 我没有Java技能。

This SO question may help ( httpclient-4-how-to-capture-last-redirect-url ) 这个SO问题可能会有所帮助( httpclient-4-how-to-capture-last-redirect-url

  1. There is probably eg a .htaccess and mod_rewrite redirect. 可能有例如.htaccessmod_rewrite重定向。 Using Firefox's Console I could see the requests. 使用Firefox的控制台我可以看到请求。 As you can see below the server is sending back a 301 Moved Permanently message. 如下所示,服务器正在发回301 Moved Permanently消息。 This tells the browser to redirect to the address returned in the Location header of the response. 这告诉浏览器重定向到响应的Location标头中返回的地址。 网络请求
  2. The way you obtain the changed URL depends on the way you load the page: 获取更改的URL的方式取决于加载页面的方式:
    • If you use ready libraries & code to load the page to eg a DOM object, the you could use that ready HTTP system to load the response, this will probably result to it automatically redirecting -> you will get the URL from the URL of the loaded page. 如果您使用现成的库和代码将页面加载到例如DOM对象,您可以使用该现成的HTTP系统加载响应,这可能会导致它自动重定向 - >您将从URL获取URL加载页面。 If it does not do that, then you must check for status code 301 or 302 and when those are received then the changed URL is in the Location header of the response. 如果不这样做,那么您必须检查状态代码301或302,当收到这些代码时,更改的URL位于响应的Location标头中。
    • If you have your own code written to load the response via TCP sockets, then you must just load the response as normal, but again check for the 301 and 302 status codes and do as described in the previous section. 如果您编写了自己的代码以通过TCP套接字加载响应,那么您必须正常加载响应,但再次检查301和302状态代码并按照上一节中的说明进行操作。

I can only attempt to address Q1 since I'm not a Java programmer. 因为我不是Java程序员,所以我只能尝试解决Q1问题。 The source code says they're using Drupal, so I speculate that they're using Drupal's global redirect module (SO discussion about Drupal redirect module here ). 源代码说他们正在使用Drupal,所以我猜测他们正在使用Drupal的全局重定向模块这里讨论关于Drupal重定向模块的讨论)。 Looking at the module's documentation might shed some light on how to obtain the correct url with Java. 查看模块的文档可能会说明如何使用Java获取正确的URL。

There's also numerous ways within javascript to have url requests automatically redirect to some base page (eg, CS homepage), while physically navigating the site allows the user to advance to new pages. 在javascript中还有许多方法可以让url请求自动重定向到某个基页(例如,CS主页),而物理导航网站则允许用户前进到新页面。 This is standard practice in many single page web apps. 这是许多单页Web应用程序中的标准做法。 If this is the case, then @hexafraction 's suggestion might be able to help you retrieve the desired url, though I'm unfamiliar with the Java methods (s)he is suggesting. 如果是这种情况,那么@hexafraction的建议可能会帮助您检索所需的URL,尽管我不熟悉他建议的Java方法。

You can get the Redirect URL from the below code setting followRedirects to false . 您可以从以下代码设置followRedirects获取Redirect URLfalse

You will get the source code of the redirected page if you set it to true and that's the default behavior of Jsoup 如果将其设置为true ,那么您将获得重定向页面的源代码,这是Jsoup的默认行为

 Connection con = Jsoup.connect("http://www.seas.harvard.edu/academics/areas/computer-science")
                              .userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
                              .followRedirects(false);

           System.out.println("Redirected Url : " + con.execute().header("Location")); //null if followRedirect is true

           Document doc = con.get();
           System.out.println(doc.html());
           System.out.println("=================================================");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM