简体   繁体   中英

Extracting relative links from a web page in proper format using Jsoup

I have parsed the outlinks of a web page which I am going to parse again using Jsoup. But the problem is that, the links are of the form: ../../../pincode/india/andaman-and-nicobar- islands/ . In this form I cannot parse them. So I have converted to absolute url using link.attr("abs:href") with the help of other post of stackoverflow.

Url of the first web page that I have parsed is: http://www.mapsofindia.com/pincode/india/ . And the absolute URls that I have got after parsing is of the form http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/ . But I cannot parse them further using Jsoup. So when I am executing the following statement:

Jsoup.parse("http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/");

It is giving HTTP 400 error ie bad request. So I think there is some problem with the Urls. So can anyone please help me to solve the above problem to get the urls in proper manner so that I can parse them further. Thank you.

please test these two things:

  1. try using link.absUrl("href") instead of link.attr("abs:href")
  2. Check the base uri (calling baseUri() on your element or document)

Btw. you better use connect() Method for this thing:

Document doc = Jsoup.connect("http://<your url here>").get();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM