简体   繁体   English

从WebPage程序获取所有图像| Java的

[英]Get all Images from WebPage Program | Java

Currently I need a program that given a URL, returns a list of all the images on the webpage. 目前我需要一个给出URL的程序,返回网页上所有图像的列表。

ie: 即:

logo.png gallery1.jpg test.gif logo.png gallery1.jpg test.gif

Is there any open source software available before I try and code something? 在我尝试编写代码之前是否有可用的开源软件?

Language should be java. 语言应该是java。 Thanks Philip 谢谢Philip

Just use a simple HTML parser , like jTidy , and then get all elements by tag name img and then collect the src attribute of each in a List<String> or maybe List<URI> . 只需使用一个简单的HTML解析器 ,如jTidy ,然后按标签名称 img 获取所有元素 ,然后在List<String>List<URI>收集每个元素src属性。

You can obtain an InputStream of an URL using URL#openStream() and then feed it to any HTML parser you like to use. 您可以使用URL#openStream()获取URLInputStream ,然后将其提供给您要使用的任何HTML解析器。 Here's a kickoff example: 这是一个启动示例:

InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();

for (int i = 0; i < imgs.getLength(); i++) {
    srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}

for (String src: srcs) {
    System.out.println(src);
}

I must however admit that HtmlUnit as suggested by Bozho indeed looks better. 但我必须承认,Bozho建议的HtmlUnit确实看起来更好。

HtmlUnit has HtmlPage.getElementsByTagName("img") , which will probably suit you. HtmlUnitHtmlPage.getElementsByTagName("img") ,这可能适合你。

(read the short Get started guide to see how to obtain the correct HtmlPage object) (阅读简短的入门指南,了解如何获取正确的HtmlPage对象)

This is dead simple with HTML Parser (and any other decent HTML parser): 使用HTML Parser (以及任何其他体面的HTML解析器)这很简单:

Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));

for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
    Tag tag = (Tag) iterator.nextNode();
    System.out.println(tag.getAttribute("src"));
}

You can use wget that has a lot of options available. 您可以使用具有许多可用选项的wget

Or google for java wget ... 或google for java wget ...

You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. 您可以解析HTML,并收集集合中IMG元素的所有SRC属性。 Then download each resource from each url and write it to a file. 然后从每个URL下载每个资源并将其写入文件。 For parsing there are several HTML parsers available, Cobra is one of them. 对于解析,有几种可用的HTML解析器, Cobra就是其中之一。

With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results): 使用Open Graph标签和HTML单元,您可以非常轻松地提取数据(PageMeta是一个包含结果的简单POJO):

    Parser parser = new Parser(url);

    PageMeta pageMeta = new PageMeta();
    pageMeta.setUrl(url);

    NodeList meta = parser.parse(new TagNameFilter("meta"));
    for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
        Tag tag = (Tag) iterator.nextNode();

        if ("og:image".equals(tag.getAttribute("property"))) {
            pageMeta.setImageUrl(tag.getAttribute("content"));
        }

        if ("og:title".equals(tag.getAttribute("property"))) {
            pageMeta.setTitle(tag.getAttribute("content"));
        }

        if ("og:description".equals(tag.getAttribute("property"))) {
            pageMeta.setDescription(tag.getAttribute("content"));
        }
    }

You can simply use regular expression in Java 您可以在Java中使用正则表达式

 <html> <body> <p> <img src="38220.png" alt="test" title="test" /> <img src="32222.png" alt="test" title="test" /> </p> </body> </html> 

    String s ="html";  //above html content
    Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
    Matcher  m = p.matcher (s);
    while (m.find()) {
        String src = m.group();
        int startIndex = src.indexOf("src=") + 5;
        String srcTag = src.substring(startIndex, src.length());
        System.out.println( srcTag );
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM