[英]Extracting relative links from a web page in proper format using Jsoup
I have parsed the outlinks of a web page which I am going to parse again using Jsoup. 我已经解析了一个网页的外部链接,我将再次使用Jsoup对其进行解析。 But the problem is that, the links are of the form:
../../../pincode/india/andaman-and-nicobar- islands/
. 但是问题在于,链接的格式为:
../../../pincode/india/andaman-and-nicobar- islands/
。 In this form I cannot parse them. 我无法以这种形式解析它们。 So I have converted to absolute url using
link.attr("abs:href")
with the help of other post of stackoverflow. 因此,在其他stackoverflow的帮助下,我已使用
link.attr("abs:href")
转换为绝对URL。
Url of the first web page that I have parsed is: http://www.mapsofindia.com/pincode/india/
. 我解析的第一个网页的网址是:
http://www.mapsofindia.com/pincode/india/
: http://www.mapsofindia.com/pincode/india/
。 And the absolute URls that I have got after parsing is of the form http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/
. 解析后获得的绝对URls的格式为
http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/
。 But I cannot parse them further using Jsoup. 但是我无法使用Jsoup进一步解析它们。 So when I am executing the following statement:
因此,当我执行以下语句时:
Jsoup.parse("http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/");
It is giving HTTP 400 error ie bad request. 它给出了HTTP 400错误,即错误的请求。 So I think there is some problem with the Urls.
因此,我认为Urls存在一些问题。 So can anyone please help me to solve the above problem to get the urls in proper manner so that I can parse them further.
因此,任何人都可以帮助我解决上述问题,以适当的方式获取网址,以便我进一步解析它们。 Thank you.
谢谢。
please test these two things: 请测试以下两件事:
link.absUrl("href")
instead of link.attr("abs:href")
link.absUrl("href")
而不是link.attr("abs:href")
baseUri()
on your element or document) baseUri()
) Btw. 顺便说一句。 you better use
connect()
Method for this thing: 您最好将
connect()
方法用于此操作:
Document doc = Jsoup.connect("http://<your url here>").get();
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.