简体   繁体   English

使用Jsoup以正确的格式从网页中提取相对链接

[英]Extracting relative links from a web page in proper format using Jsoup

I have parsed the outlinks of a web page which I am going to parse again using Jsoup. 我已经解析了一个网页的外部链接,我将再次使用Jsoup对其进行解析。 But the problem is that, the links are of the form: ../../../pincode/india/andaman-and-nicobar- islands/ . 但是问题在于,链接的格式为: ../../../pincode/india/andaman-and-nicobar- islands/ In this form I cannot parse them. 我无法以这种形式解析它们。 So I have converted to absolute url using link.attr("abs:href") with the help of other post of stackoverflow. 因此,在其他stackoverflow的帮助下,我已使用link.attr("abs:href")转换为绝对URL。

Url of the first web page that I have parsed is: http://www.mapsofindia.com/pincode/india/ . 我解析的第一个网页的网址是: http://www.mapsofindia.com/pincode/india/ : http://www.mapsofindia.com/pincode/india/ And the absolute URls that I have got after parsing is of the form http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/ . 解析后获得的绝对URls的格式为http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/ But I cannot parse them further using Jsoup. 但是我无法使用Jsoup进一步解析它们。 So when I am executing the following statement: 因此,当我执行以下语句时:

Jsoup.parse("http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/");

It is giving HTTP 400 error ie bad request. 它给出了HTTP 400错误,即错误的请求。 So I think there is some problem with the Urls. 因此,我认为Urls存在一些问题。 So can anyone please help me to solve the above problem to get the urls in proper manner so that I can parse them further. 因此,任何人都可以帮助我解决上述问题,以适当的方式获取网址,以便我进一步解析它们。 Thank you. 谢谢。

please test these two things: 请测试以下两件事:

  1. try using link.absUrl("href") instead of link.attr("abs:href") 尝试使用link.absUrl("href")而不是link.attr("abs:href")
  2. Check the base uri (calling baseUri() on your element or document) 检查基本uri(在元素或文档上调用baseUri()

Btw. 顺便说一句。 you better use connect() Method for this thing: 您最好将connect()方法用于此操作:

Document doc = Jsoup.connect("http://<your url here>").get();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM