简体   繁体   English

在java中解析带有“unclosed tags”的html

[英]Parsing html with “unclosed tags” in java

My Question is quite simple: is there a way to parse html in java to a DOM-Document, if there are tags like this img-tag in the htmlcontent? 我的问题很简单:如果在htmlcontent中有像img-tag这样的标签,有没有办法将java中的html解析为DOM-Document?

<p><img src="..."></p>

This is the Codesnippet that gives me a SAXException while parsing these elements: 这是在解析这些元素时给我一个SAXException的Codesnippet:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputStream is = new ByteArrayInputStream( htmlcontent.getBytes());
Document dom = db.parse(is);
is.close();

I don't think so but jsoup can do that. 我不这么认为,但是jsoup可以做到这一点。 It's not the DOM API but it's quite similar. 它不是DOM API,但它非常相似。

You cannot use the DocumentBuilder because it is an XML parser. 您不能使用DocumentBuilder因为它是XML解析器。

But you need an HTML parser like: 但是你需要一个HTML解析器,如:

HTML isn't XML. HTML不是XML。

Except when you're using XHTML. 除非您使用XHTML。

So there is no reason an XML parser should parse your HTML. 因此,XML解析器没有理由解析您的HTML。

Use a HTML parser like HtmlCleaner . 使用像HtmlCleaner这样的HTML解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM