在java中解析带有“unclosed tags”的html

Question

My Question is quite simple: is there a way to parse html in java to a DOM-Document, if there are tags like this img-tag in the htmlcontent? 我的问题很简单：如果在htmlcontent中有像img-tag这样的标签，有没有办法将java中的html解析为DOM-Document？

<p><img src="..."></p>

This is the Codesnippet that gives me a SAXException while parsing these elements: 这是在解析这些元素时给我一个SAXException的Codesnippet：

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputStream is = new ByteArrayInputStream( htmlcontent.getBytes());
Document dom = db.parse(is);
is.close();

Answer 1

I don't think so but jsoup can do that. 我不这么认为，但是jsoup可以做到这一点。 It's not the DOM API but it's quite similar. 它不是DOM API，但它非常相似。

Answer 2

You cannot use the DocumentBuilder because it is an XML parser. 您不能使用DocumentBuilder因为它是XML解析器。

But you need an HTML parser like: 但是你需要一个HTML解析器，如：

Jericho HTML Parser 杰里科HTML解析器
Neko HTML Parser Neko HTML Parser

Answer 3

One of these may help: 其中一个可能会有所帮助：

Answer 4

HTML isn't XML. HTML不是XML。

Except when you're using XHTML. 除非您使用XHTML。

So there is no reason an XML parser should parse your HTML. 因此，XML解析器没有理由解析您的HTML。

Use a HTML parser like HtmlCleaner . 使用像HtmlCleaner这样的HTML解析器。

在java中解析带有“unclosed tags”的html

问题描述

4 个解决方案

解决方案1
3 已采纳 2012-07-12 14:47:10

解决方案2
1 2012-07-12 14:46:53

解决方案3
1 2012-07-12 15:06:21

解决方案4
0 2012-07-12 14:47:12

在java中解析带有“unclosed tags”的html

问题描述

4 个解决方案

解决方案1 3 已采纳 2012-07-12 14:47:10

解决方案2 1 2012-07-12 14:46:53

解决方案3 1 2012-07-12 15:06:21

解决方案4 0 2012-07-12 14:47:12

解决方案1
3 已采纳 2012-07-12 14:47:10

解决方案2
1 2012-07-12 14:46:53

解决方案3
1 2012-07-12 15:06:21

解决方案4
0 2012-07-12 14:47:12