简体繁体 English

使用JAVA从HTML中提取所有图像

[英]Extract All Images From HTML Using JAVA

原文 2011-02-03 04:12:46 2 2 java

I want to get the list of all Image urls from HTML source of a webpage(Both abosulte and relative urls). 我想从网页的HTML源代码中获取所有图片网址的列表（abosulte和相对网址）。 I used Jsoup to parse the HTML but its not giving all images. 我使用Jsoup来解析HTML，但它没有给出所有图像。 For example when I am parsing google.com HTML source its showing zero images..In google.com HTML source image links are in form.. 例如，当我解析google.com HTML源时，它显示零图像..在google.com中HTML源图像链接在表单中..

"background:url(/intl/en_com/images/srpr/logo1w.png)

And in rediff.com the images links are in form.. 在rediff.com中，图片链接的形式是..

videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bappi-da-the-first-indian-in-grammy-jury/2684982","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/v3np2zgbla4vdccf.D.0.bappi.jpg","Bappi Da - the first Indian In Grammy jury","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:33)"); j = 1 videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bebo-shahid-jab-they-met-again-/2681664","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/ra8p9eeig8zy5qvd.D.0.They-Met-Again.jpg","Bebo-Shahid : Jab they met again!","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:17)");

All images are not with in "img" tags..I also want to extract images which are not even with in "img" tags as shown in above HTML source. 所有图像都没有在“img”标签中。我还想提取在“img”标签中不均匀的图像，如上面的HTML源代码所示。

How can I do this..?Please help me on this.. Thanks 我怎么能这样做？？请帮我这个..谢谢

2 个解决方案

This is going to be a bit difficult, I think. 我想这会有点困难。 You basically need a library that will download a web page, construct the page's DOM and execute any javascript that may alter the DOM. 你基本上需要一个库来下载一个网页，构建页面的DOM并执行任何可能改变DOM的javascript。 After all that is done you have to extract all the possible images from the DOM. 完成所有这些后，您必须从DOM中提取所有可能的图像。 Another possible option is to intercept all calls by library to download resources, examine the URL and if the URL is an image record that URL. 另一种可能的选择是拦截库中的所有调用以下载资源，检查URL以及URL是否是URL的图像记录。

My suggestion would be to start by playing with HtmlUnit(http://htmlunit.sourceforge.net/gettingStarted.html.) It does a good job of building the DOM. 我的建议是首先使用HtmlUnit（http://htmlunit.sourceforge.net/gettingStarted.html）。它可以很好地构建DOM。 I'm not sure what types of hooks it has, for intercepting the methods that download resources. 我不确定它有什么类型的钩子，用于拦截下载资源的方法。 Of course if it doesn't provide you with the hooks you can always use AspectJ or simply modify the HtmlUnit source code. 当然，如果它没有为您提供钩子，您可以始终使用AspectJ或只是修改HtmlUnit源代码。 Good luck, this sounds like a reasonably interesting problem. 祝你好运，这听起来像一个相当有趣的问题。 You should post your solution, when you figure it out. 当你想出来时，你应该发布你的解决方案。

If you just want every image referred to in the page, can't you just scan the HTML and any linked javascript or CSS with a simple regex? 如果您只想要页面中引用的每个图像，那么您是否只能使用简单的正则表达式扫描HTML和任何链接的javascript或CSS？ How likely is it you'd get [-:_./%a-zA-Z0-9]*(.jpg|.png|.gif) in the HTML/JS/CSS that's not an image? 你不可能在HTML / JS / CSS中得到[-:_./%a-zA-Z0-9]*(.jpg|.png|.gif)可能性有多大？ I'd guess not very likely. 我猜不太可能。 And you should be allowing for broken links anyway. 而且你应该允许破坏链接。

Karthik's suggestion would be more correct, but I imagine it's more important to you to just get absolutely everything and filter out uninteresting images. Karthik的建议会更正确，但我认为让你获得绝对一切并滤除无趣的图像更为重要。