[英]Parsing html with “unclosed tags” in java
My Question is quite simple: is there a way to parse html in java to a DOM-Document, if there are tags like this img-tag in the htmlcontent? 我的问题很简单:如果在htmlcontent中有像img-tag这样的标签,有没有办法将java中的html解析为DOM-Document?
<p><img src="..."></p>
This is the Codesnippet that gives me a SAXException while parsing these elements: 这是在解析这些元素时给我一个SAXException的Codesnippet:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream is = new ByteArrayInputStream( htmlcontent.getBytes());
Document dom = db.parse(is);
is.close();
You cannot use the DocumentBuilder
because it is an XML parser. 您不能使用
DocumentBuilder
因为它是XML解析器。
But you need an HTML parser like: 但是你需要一个HTML解析器,如:
One of these may help: 其中一个可能会有所帮助:
HTML isn't XML. HTML不是XML。
Except when you're using XHTML. 除非您使用XHTML。
So there is no reason an XML parser should parse your HTML. 因此,XML解析器没有理由解析您的HTML。
Use a HTML parser like HtmlCleaner . 使用像HtmlCleaner这样的HTML解析器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.