[英]Extract images using jsoup that have no img tag
I need to extract images that are within the div and the src isn't listed within an img tag. 我需要提取div内的图像,并且src未在img标签内列出。 I cannot do a getElementById() either since it varies from page to page. 我不能执行getElementById(),因为它在页面之间有所不同。 Is there some regex I can use to extract the images from doc for such cases? 在这种情况下,我可以使用一些正则表达式从doc中提取图像吗? Any help is appreciated. 任何帮助表示赞赏。
HTML snippet: HTML片段:
<div
class="rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center"
data-src="/content/dam/Image.jpg.transform/default-
mobile/image.jpg"
data-mobile-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-tablet-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-desktop- rendition="/content/dam/Image.jpg.transform/default-desktop/image.jpg"
style="background-image: url("/content/dam/Image.jpg.transform/default-
mobile/image.jpg");">
</div>
Far from elegant or easy solution, but here is something that, hopefully, can give you some start: 绝非优雅或简单的解决方案,但希望可以为您提供一些开始:
String snippet =
"<div class=\"rendition-bg rendition-bg--alignment desktop-center-center" +
"mobile-center-center \" data-src=\"/content/dam/Image.jpg.transform/default-" +
"mobile/image.jpg\" data-mobile- \n" +
"rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" data-" +
"tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\"" +
"data-desktop- rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\"" +
"style=\"background-image: url("/content/dam/Image.jpg.transform/default-" +
"mobile/image.jpg");\"></div>";
List<String> imgAttrs =
Jsoup.parse(snippet)
.getElementsByTag("div")
.stream()
// get lists of attributes
.map(Element::attributes)
// flatten all attrs to single list
.flatMap(attrs -> attrs.asList().stream())
// filter attributes
.filter(attribute -> attribute.getValue() != null && attribute.getValue().contains(".jpg"))
// map to values
.map(Attribute::getValue)
// replace all ".transform" with a whitespace
.map(attrValue -> attrValue.replace(".transform", " "))
// get url value of a "background-image"
.map(attrValue -> getUrlFromBackgroundImage(attrValue))
// split attributes by whitespaces
.flatMap(attrValue -> Stream.of(attrValue.split(" ")))
.collect(toList());
}
private static String getUrlFromBackgroundImage(final String backgroundImage) {
Pattern pattern = Pattern.compile("background-image:[ ]?url\\((['\"]?(.*?\\.(?:png|jpg|jpeg|gif)(\\s)?)*)");
Matcher matcher = pattern.matcher(backgroundImage);
return matcher.find() ? matcher.group(1) : backgroundImage;
}
The contents of imgAttrs should be: imgAttrs的内容应为:
/content/dam/Image.jpg
/default-mobile/image.jpg
/content/dam/Image.jpg
/default-desktop/image.jpg
/content/dam/Image.jpg
/default-mobile/image.jpg
"/content/dam/Image.jpg
/default-mobile/image.jpg
Not sure if that's what you need though. 不确定是否是您所需要的。
Explanation in comments: 注释中的解释:
Document doc = Jsoup.parse(
"<div class=\"rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center \" "
+ "data-src=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
+ "data-mobile-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
+ "data-tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
+ "data-desktop-rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\" "
+ "style=\"background-image: url("/content/dam/Image.jpg.transform/default-mobile/image.jpg");\"></div>");
// select all elements with "data-src" attribute, but here we use only the first of them
Map<String, String> dataAttributes = doc.select("[data-src]").first().dataset();
// here we have all data attributes of this element:
System.out.println(dataAttributes);
// you can access them like this:
System.out.println(dataAttributes.get("mobile-rendition"));
System.out.println(dataAttributes.get("tablet-rendition"));
System.out.println(dataAttributes.get("desktop-rendition"));
// split and create list of urls (contains duplicates)
List<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split("\\.transform")))
.collect(Collectors.toList());
// if you need only unique urls use this one instead:
// Set<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split(".transform"))).collect(Collectors.toSet());
System.out.println(urls);
Observing the div closely, we can see that there are 2 images referenced. 仔细观察div,我们可以看到引用了2张图片。 They are 他们是
data-src= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-mobile-rendition= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-tablet-rendition= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-desktop- rendition= "/content/dam/Image.jpg.transform/default-desktop/image.jpg"
style="background-image: url/content/dam/Image.jpg.transform/default-mobile/image.jpg
Of these four image references 3 are referring to the same image while another one is referring to a desktop ~ image. 在这四个图像参考中,三个参考是指同一图像,而另一个参考是指桌面图像。 So if we need to extract URL for these two images: 因此,如果我们需要提取这两个图像的URL:
data-src= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-desktop- rendition= "/content/dam/Image.jpg.transform/default-desktop/image.jpg"
We can use the following code: 我们可以使用以下代码:
Elements els = doc.select("div.rendition-bg");
for (Element ele :els){
System.out.println(ele.absUrl("data-src"));
System.out.println(ele.absUrl("data-desktop-"));
}
Let me know if I have understood your requirement correctly. 让我知道我是否正确理解了您的要求。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.