使用Jsoup以正确的格式从网页中提取相对链接

Question

I have parsed the outlinks of a web page which I am going to parse again using Jsoup. 我已经解析了一个网页的外部链接，我将再次使用Jsoup对其进行解析。 But the problem is that, the links are of the form: ../../../pincode/india/andaman-and-nicobar- islands/ . 但是问题在于，链接的格式为： ../../../pincode/india/andaman-and-nicobar- islands/ 。 In this form I cannot parse them. 我无法以这种形式解析它们。 So I have converted to absolute url using link.attr("abs:href") with the help of other post of stackoverflow. 因此，在其他stackoverflow的帮助下，我已使用link.attr("abs:href")转换为绝对URL。

Url of the first web page that I have parsed is: http://www.mapsofindia.com/pincode/india/ . 我解析的第一个网页的网址是： http://www.mapsofindia.com/pincode/india/ : http://www.mapsofindia.com/pincode/india/ 。 And the absolute URls that I have got after parsing is of the form http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/ . 解析后获得的绝对URls的格式为http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/ 。 But I cannot parse them further using Jsoup. 但是我无法使用Jsoup进一步解析它们。 So when I am executing the following statement: 因此，当我执行以下语句时：

Jsoup.parse("http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/");

It is giving HTTP 400 error ie bad request. 它给出了HTTP 400错误，即错误的请求。 So I think there is some problem with the Urls. 因此，我认为Urls存在一些问题。 So can anyone please help me to solve the above problem to get the urls in proper manner so that I can parse them further. 因此，任何人都可以帮助我解决上述问题，以适当的方式获取网址，以便我进一步解析它们。 Thank you. 谢谢。

Answer 1

please test these two things: 请测试以下两件事：

try using link.absUrl("href") instead of link.attr("abs:href") 尝试使用link.absUrl("href")而不是link.attr("abs:href")
Check the base uri (calling baseUri() on your element or document) 检查基本uri（在元素或文档上调用baseUri() ）

Btw. 顺便说一句。 you better use connect() Method for this thing: 您最好将connect()方法用于此操作：

Document doc = Jsoup.connect("http://<your url here>").get();

使用Jsoup以正确的格式从网页中提取相对链接

问题描述

1 个解决方案

解决方案1
1 2013-04-13 17:22:05

使用Jsoup以正确的格式从网页中提取相对链接

问题描述

1 个解决方案

解决方案1 1 2013-04-13 17:22:05

解决方案1
1 2013-04-13 17:22:05