简体   繁体   English

Jsoup:如何提取文件名中有空格的img?

[英]Jsoup : how to extract img with space in filename?

I am trying to extract img using Jsoup.我正在尝试使用 Jsoup 提取 img。 It works fine for images without any space in filename but it extract only the first part if there is a white space.它适用于文件名中没有任何空格的图像,但如果有空格,它只提取第一部分。

I tried with below.我在下面试过。

String result = Jsoup.clean(content,"https://rally1.rallydev.com/", Whitelist.relaxed().preserveRelativeLinks(true), new Document.OutputSettings().prettyPrint(false));
        Document doc = Jsoup.parse(result);
        Elements images = doc.select("img");

eg HTML content例如 HTML 内容

Description:<div>some text content<br /></div> 
<div><img src=/slm/attachment/43647556403/My file with space.png /></div>
<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>

result content is: result内容为:

Description:Some text content<br> <img src="/slm/attachment/43647556403/My"><img src="/slm/attachment/43648152373/my_file_without_space.png/">

in "result" for the image with space in file name has only first part "My".在文件名中有空格的图像的“结果”中只有第一部分“我的”。 It ignored the content after whitespace.它忽略了空格后的内容。

How to extract filename if that contains space?如果包含空格,如何提取文件名?

The problem can't be easily solved in Jsoup, since the src attribute value of the example with spaces actually is correctly identified to be only My .这个问题在 Jsoup 中不能轻易解决,因为带有空格的示例的src属性值实际上被正确识别为只有My The file , with and space.png parts are in this example also attributes without values. file , withspace.png部分在此示例中也是没有值的属性。 Of course you can use JSoup to concatenate the attribute keys that follow the src attribute to its value.当然,您可以使用 JSoup 将 src 属性后面的属性键连接到其值。 For example like this:例如像这样:

String test =""
        + "<div><img src=/slm/attachment/43647556403/My file with space.png /></div>"
        + "<div><img src=/slm/attachment/43647556403/My file with space.png name=whatever/></div>"
        + "<div><img src=/slm/attachment/43647556403/This  breaks  it.png name=whatever/></div>"
        + "<div><img src=\"/slm/attachment/43647556403/This  works.png\" name=whatever/></div>"
        + "<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>";
Document doc = Jsoup.parse(test);
Elements imgs = doc.select("img");
for (Element img : imgs){
    Attribute src = null;
    StringBuffer newSrcVal = new StringBuffer();
    List<String> toRemove = new ArrayList<>();
    for (Attribute a : img.attributes()){
        if (a.getKey().equals("src")){
            newSrcVal.append(a.getValue());
            src = a;
        }
        else if (newSrcVal.length()>0){
            //we already found the scr tag
            if (a.getValue().isEmpty()){
                newSrcVal.append(" ").append(a.getKey());
                toRemove.add(a.getKey());
            }
            else{
                //the empty attributes, i.e. file name parts are over
                break;
            }
        }               
    }
    for (String toRemAttr : toRemove){
        img.removeAttr(toRemAttr);
    }
    src.setValue(newSrcVal.toString());
}
System.out.println(doc);

This algorithm cycles over all img elements and within each img it cycles over its attributes.该算法循环遍历所有 img 元素,并在每个 img 中循环遍历其属性。 When it finds the src attribute it keeps it for reference and starts to fill the newSrcBuf StringBuffer.当它找到src属性时,它会保留它以供参考并开始填充newSrcBuf StringBuffer。 All following value-less attributes will be added to to newSrcBuf until either another attribute with value is found or there are no more attributes.以下所有无值属性都将添加到newSrcBuf直到找到另一个具有值的属性或没有更多属性。 Finally the scr attribute value is reset with the contents of newSrcBuf and the former empty attributes are removed from the DOM.最后用newSrcBuf的内容重置 scr 属性值,并从 DOM 中删除以前的空属性。

Note that this will not work when your filename contains two or more consecutive spaces.请注意,当您的文件名包含两个或多个连续空格时,这将不起作用。 JSoup discards those spaces between attributes and therefore you can't get them back after parsing. JSoup 会丢弃属性之间的那些空格,因此您无法在解析后取回它们。 If you need that, then you need to manipulate the input html before parsing.如果需要,则需要在解析之前操作输入 html。

You can something like this:你可以这样:

 Elements images = doc.select("img");

 for(Element image: images){
 String imgSrc = image.attr("src");
 imgSrc = imgSrc.subString(imgSrc.lastIndexOf("/"), imgSrc.length()); // this will give you name.png
 }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM