简体   繁体   English

使用没有img标签的jsoup提取图像

[英]Extract images using jsoup that have no img tag

I need to extract images that are within the div and the src isn't listed within an img tag. 我需要提取div内的图像,并且src未在img标签内列出。 I cannot do a getElementById() either since it varies from page to page. 我不能执行getElementById(),因为它在页面之间有所不同。 Is there some regex I can use to extract the images from doc for such cases? 在这种情况下,我可以使用一些正则表达式从doc中提取图像吗? Any help is appreciated. 任何帮助表示赞赏。

HTML snippet: HTML片段:

<div 
    class="rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center" 
    data-src="/content/dam/Image.jpg.transform/default- 
mobile/image.jpg" 
    data-mobile-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-tablet-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-desktop- rendition="/content/dam/Image.jpg.transform/default-desktop/image.jpg" 
    style="background-image: url(&quot;/content/dam/Image.jpg.transform/default- 
mobile/image.jpg&quot;);">
</div>

Far from elegant or easy solution, but here is something that, hopefully, can give you some start: 绝非优雅或简单的解决方案,但希望可以为您提供一些开始:

    String snippet =
      "<div class=\"rendition-bg rendition-bg--alignment desktop-center-center" +
        "mobile-center-center \" data-src=\"/content/dam/Image.jpg.transform/default-" +
        "mobile/image.jpg\" data-mobile- \n" +
        "rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" data-" +
        "tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\"" +
        "data-desktop- rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\"" +
        "style=\"background-image: url(&quot;/content/dam/Image.jpg.transform/default-" +
        "mobile/image.jpg&quot;);\"></div>";

    List<String> imgAttrs =
      Jsoup.parse(snippet)
        .getElementsByTag("div")
        .stream()
        // get lists of attributes
        .map(Element::attributes)
        // flatten all attrs to single list
        .flatMap(attrs -> attrs.asList().stream())
        // filter attributes
        .filter(attribute -> attribute.getValue() != null && attribute.getValue().contains(".jpg"))
        // map to values
        .map(Attribute::getValue)
        // replace all ".transform" with a whitespace
        .map(attrValue -> attrValue.replace(".transform", " "))
        // get url value of a "background-image"
        .map(attrValue -> getUrlFromBackgroundImage(attrValue))
        // split attributes by whitespaces
        .flatMap(attrValue -> Stream.of(attrValue.split(" ")))
        .collect(toList());
      }

     private static String getUrlFromBackgroundImage(final String backgroundImage) {
        Pattern pattern = Pattern.compile("background-image:[ ]?url\\((['\"]?(.*?\\.(?:png|jpg|jpeg|gif)(\\s)?)*)");
        Matcher matcher = pattern.matcher(backgroundImage);
        return matcher.find() ? matcher.group(1) : backgroundImage;
     }

The contents of imgAttrs should be: imgAttrs的内容应为:

/content/dam/Image.jpg
/default-mobile/image.jpg
/content/dam/Image.jpg
/default-desktop/image.jpg
/content/dam/Image.jpg
/default-mobile/image.jpg
"/content/dam/Image.jpg
/default-mobile/image.jpg

Not sure if that's what you need though. 不确定是否是您所需要的。

Explanation in comments: 注释中的解释:

    Document doc = Jsoup.parse(
        "<div class=\"rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center \" "
        + "data-src=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
        + "data-mobile-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
        + "data-tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
        + "data-desktop-rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\" "
        + "style=\"background-image: url(&quot;/content/dam/Image.jpg.transform/default-mobile/image.jpg&quot;);\"></div>");

    // select all elements with "data-src" attribute, but here we use only the first of them
    Map<String, String> dataAttributes = doc.select("[data-src]").first().dataset();

    // here we have all data attributes of this element:
    System.out.println(dataAttributes);

    // you can access them like this:
    System.out.println(dataAttributes.get("mobile-rendition"));
    System.out.println(dataAttributes.get("tablet-rendition"));
    System.out.println(dataAttributes.get("desktop-rendition"));

    // split and create list of urls (contains duplicates)
    List<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split("\\.transform")))
                .collect(Collectors.toList());

    // if you need only unique urls use this one instead:
    //  Set<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split(".transform"))).collect(Collectors.toSet());
    System.out.println(urls);

Observing the div closely, we can see that there are 2 images referenced. 仔细观察div,我们可以看到引用了2张图片。 They are 他们是

data-src=                  "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-mobile-rendition=     "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-tablet-rendition=     "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-desktop- rendition=   "/content/dam/Image.jpg.transform/default-desktop/image.jpg" 
style="background-image: url/content/dam/Image.jpg.transform/default-mobile/image.jpg

Of these four image references 3 are referring to the same image while another one is referring to a desktop ~ image. 在这四个图像参考中,三个参考是指同一图像,而另一个参考是指桌面图像。 So if we need to extract URL for these two images: 因此,如果我们需要提取这两个图像的URL:

data-src=                  "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-desktop- rendition=   "/content/dam/Image.jpg.transform/default-desktop/image.jpg"

We can use the following code: 我们可以使用以下代码:

        Elements els = doc.select("div.rendition-bg");
        for (Element ele :els){
                System.out.println(ele.absUrl("data-src"));
                System.out.println(ele.absUrl("data-desktop-"));                
            }

Let me know if I have understood your requirement correctly. 让我知道我是否正确理解了您的要求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM