在Web爬虫中解析HTML

Question

Further to my earlier question here: Extending a basic web crawler to filter status codes and HTML , I'm trying to extract information from HTML tags, in this case "title", with the following method: 继我之前的问题：扩展基本网络爬虫以过滤状态代码和HTML ，我试图从HTML标签中提取信息，在本例中为“标题”，使用以下方法：

public static void parsePage() throws IOException, BadLocationException 
{
    HTMLEditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
    Reader HTMLReader = new InputStreamReader(testURL.openConnection()
            .getInputStream());
    kit.read(HTMLReader, doc, 0);

    // Create an iterator for all HTML tags.
    ElementIterator it = new ElementIterator(doc);
    Element elem;

    while ((elem = it.next()) != null) 
    {
        if (elem.getName().equals("title")) 
        {
            System.out.println("found title tag");
        }
    }
}

This is working as far as telling me it's found the tags. 这是有用的告诉我它找到了标签。 What I'm struggling with is how to extract the information contained after/within them. 我正在努力的是如何提取其中/之后包含的信息。

I found this question on the site: Help with Java Swing HTML parsing , however it states it will only work with well-formed HTML. 我在网站上发现了这个问题：帮助Java Swing HTML解析，但它声明它只适用于格式良好的HTML。 I was hoping there is another way. 我希望还有另一种方式。

Any pointers appreciated. 任何指针赞赏。

Answer 1

Try using Jodd 尝试使用Jodd

Jerry jerry = jerry().enableHtmlMode().parse(html);
...

Or HtmlParser 或者HtmlParser

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("title");
NodeList nodes = parser.parse(cssFilter);

Answer 2

Turns out changing the method to this produces the desired result: 原来改变方法到此产生了预期的结果：

    {
            HTMLEditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            Reader HTMLReader = new InputStreamReader(testURL.openConnection().getInputStream());
            kit.read(HTMLReader, doc, 0);
            String title = (String) doc.getProperty(Document.TitleProperty);
            System.out.println(title);
    }

I think I was off on a wild goose chase with iterator/element stuff. 我认为我在使用迭代器/元素的东西进行疯狂的追逐。

在Web爬虫中解析HTML

问题描述

2 个解决方案

解决方案1
3 2012-07-14 21:24:02

解决方案2
1 已采纳 2012-07-14 21:57:23

在Web爬虫中解析HTML

问题描述

2 个解决方案

解决方案1 3 2012-07-14 21:24:02

解决方案2 1 已采纳 2012-07-14 21:57:23

解决方案1
3 2012-07-14 21:24:02

解决方案2
1 已采纳 2012-07-14 21:57:23