简体   繁体   English

使用jsoup相对转换为绝对链接

[英]Convert relative to absolute links using jsoup

I'm using jsoup to clean a html page, the problem is that when I save the html locally, the images do not show because they are all relative links. 我正在使用jsoup清理html页面,问题是当我在本地保存html时,图像不显示,因为它们都是相对链接。

Here's some example code: 这是一些示例代码:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class so2 {

    public static void main(String[] args) {

        String html = "<html><head><title>The Title</title></head>"
                  + "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
        Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??

        System.out.println(doc);        
    }
}

Output: 输出:

<html>
 <head>
  <title>The Title</title>
 </head>
 <body>
  <p><a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" target="_blank"><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></a></p>
 </body>
</html>

The output still shows the links as a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" . 输出仍将链接显示为a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"

I would like it to convert them to a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" 我希望将其转换为a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"

Can anyone show me how to get jsoup to convert all the links to absolute links? 谁能告诉我如何让jsoup将所有链接转换为绝对链接?

You can select all the links and transform their hrefs to absolute using Element.absUrl() 您可以选择所有链接,并使用Element.absUrl()将其href转换为绝对链接

Example in your code: 代码中的示例:

EDIT (added processing of images) 编辑(添加图像处理)

public static void main(String[] args) {

    String html = "<html><head><title>The Title</title></head>"
              + "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
    Document doc = Jsoup.parse(html,"https://whatever.com"); 

    Elements select = doc.select("a");
    for (Element e : select){
        // baseUri will be used by absUrl
        String absUrl = e.absUrl("href");
        e.attr("href", absUrl);
    }

    //now we process the imgs
    select = doc.select("img");
    for (Element e : select){
        e.attr("src", e.absUrl("src"));
    }

    System.out.println(doc);        
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM