从WebPage程序获取所有图像| Java的

Question

Currently I need a program that given a URL, returns a list of all the images on the webpage. 目前我需要一个给出URL的程序，返回网页上所有图像的列表。

ie: 即：

logo.png gallery1.jpg test.gif logo.png gallery1.jpg test.gif

Is there any open source software available before I try and code something? 在我尝试编写代码之前是否有可用的开源软件？

Language should be java. 语言应该是java。 Thanks Philip 谢谢Philip

Answer 1

Just use a simple HTML parser , like jTidy , and then get all elements by tag name img and then collect the src attribute of each in a List<String> or maybe List<URI> . 只需使用一个简单的HTML解析器，如jTidy ，然后按标签名称 img 获取所有元素，然后在List<String>或List<URI>收集每个元素的src属性。

You can obtain an InputStream of an URL using URL#openStream() and then feed it to any HTML parser you like to use. 您可以使用URL#openStream()获取URL的InputStream ，然后将其提供给您要使用的任何HTML解析器。 Here's a kickoff example: 这是一个启动示例：

InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();

for (int i = 0; i < imgs.getLength(); i++) {
    srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}

for (String src: srcs) {
    System.out.println(src);
}

I must however admit that HtmlUnit as suggested by Bozho indeed looks better. 但我必须承认，Bozho建议的HtmlUnit确实看起来更好。

Answer 2

HtmlUnit has HtmlPage.getElementsByTagName("img") , which will probably suit you. HtmlUnit有HtmlPage.getElementsByTagName("img") ，这可能适合你。

(read the short Get started guide to see how to obtain the correct HtmlPage object) （阅读简短的入门指南，了解如何获取正确的HtmlPage对象）

Answer 3

This is dead simple with HTML Parser (and any other decent HTML parser): 使用HTML Parser （以及任何其他体面的HTML解析器）这很简单：

Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));

for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
    Tag tag = (Tag) iterator.nextNode();
    System.out.println(tag.getAttribute("src"));
}

Answer 4

You can use wget that has a lot of options available. 您可以使用具有许多可用选项的wget 。

Or google for java wget ... 或google for java wget ...

Answer 5

You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. 您可以解析HTML，并收集集合中IMG元素的所有SRC属性。 Then download each resource from each url and write it to a file. 然后从每个URL下载每个资源并将其写入文件。 For parsing there are several HTML parsers available, Cobra is one of them. 对于解析，有几种可用的HTML解析器， Cobra就是其中之一。

Answer 6

With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results): 使用Open Graph标签和HTML单元，您可以非常轻松地提取数据（PageMeta是一个包含结果的简单POJO）：

    Parser parser = new Parser(url);

    PageMeta pageMeta = new PageMeta();
    pageMeta.setUrl(url);

    NodeList meta = parser.parse(new TagNameFilter("meta"));
    for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
        Tag tag = (Tag) iterator.nextNode();

        if ("og:image".equals(tag.getAttribute("property"))) {
            pageMeta.setImageUrl(tag.getAttribute("content"));
        }

        if ("og:title".equals(tag.getAttribute("property"))) {
            pageMeta.setTitle(tag.getAttribute("content"));
        }

        if ("og:description".equals(tag.getAttribute("property"))) {
            pageMeta.setDescription(tag.getAttribute("content"));
        }
    }

Answer 7

You can simply use regular expression in Java 您可以在Java中使用正则表达式

 <html> <body> <p> <img src="38220.png" alt="test" title="test" /> <img src="32222.png" alt="test" title="test" /> </p> </body> </html>

    String s ="html";  //above html content
    Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
    Matcher  m = p.matcher (s);
    while (m.find()) {
        String src = m.group();
        int startIndex = src.indexOf("src=") + 5;
        String srcTag = src.substring(startIndex, src.length());
        System.out.println( srcTag );
    }

从WebPage程序获取所有图像| Java的

问题描述

7 个解决方案

解决方案1
12 2010-01-31 18:21:56

解决方案2
10 已采纳 2010-01-31 18:23:24

解决方案3
4 2010-01-31 18:52:45

解决方案4
0 2010-01-31 18:21:08

解决方案5
0 2010-01-31 18:24:08

解决方案6
0 2016-05-09 03:52:12

解决方案7
0 2018-02-08 08:04:40

从WebPage程序获取所有图像| Java的

问题描述

7 个解决方案

解决方案1 12 2010-01-31 18:21:56

解决方案2 10 已采纳 2010-01-31 18:23:24

解决方案3 4 2010-01-31 18:52:45

解决方案4 0 2010-01-31 18:21:08

解决方案5 0 2010-01-31 18:24:08

解决方案6 0 2016-05-09 03:52:12

解决方案7 0 2018-02-08 08:04:40

解决方案1
12 2010-01-31 18:21:56

解决方案2
10 已采纳 2010-01-31 18:23:24

解决方案3
4 2010-01-31 18:52:45

解决方案4
0 2010-01-31 18:21:08

解决方案5
0 2010-01-31 18:24:08

解决方案6
0 2016-05-09 03:52:12

解决方案7
0 2018-02-08 08:04:40