[英]Parsing HTML Element By Element at Java
I have an HTML file: 我有一个HTML文件:
<div> DIV1 <div> DIV2 <div> DIV3 </div> </div> </div>
I want to parse that HTML. 我想解析该HTML。 However I don't want to get whole parsed HTML as a string:
但是我不想将整个解析的HTML作为字符串获取:
DIV1 DIV2 DIV3
I would like to get values element by element but none of them duplicated. 我想逐个元素地获取值,但是没有一个重复。 I mean I don't want that:
我的意思是我不要:
When you get first div's value it is: 当您获得第一个div的值时,它是:
DIV1 DIV2 DIV3
Seconds div's value: 秒div的值:
DIV2 DIV3
Third div's value: 第三股的价值:
DIV3
The result that I don't want is: 我不想要的结果是:
DIV1 DIV2 DIV3
DIV2 DIV3
DIV3
I want that result: 我想要那个结果:
DIV1
DIV2
DIV2
I will apply some procedure to them and I don't want duplicated values as well. 我将对他们应用一些过程,并且我也不想重复的值。 I want to use a Java parser to solve my problem.
我想使用Java解析器解决我的问题。 I've considered to use Jsoup but you get entire HTML parsed when you use it.
我已经考虑过使用Jsoup,但是当您使用Jsoup时,您可以解析整个HTML。
It sounds like you want to do a pre order depth first search for all text nodes in an HTML document. 听起来您想对HTML文档中的所有文本节点进行预深度搜索 。 Luckily most parsing libraries including XML ones will give you all the nodes in pre order as an iterator.
幸运的是,大多数解析库(包括XML解析库)都会以迭代器的形式按顺序为您提供所有节点。
I recommend you use Jericho and call getNodeIterator() and just check to see if its a text node and if it is you print it out. 我建议您使用Jericho并调用getNodeIterator(),然后检查它是否是文本节点,以及是否将其打印出来。 Nootice the link has example code but I will paste it here for your convenience:
Nootice该链接具有示例代码,但为了方便起见,我将其粘贴在此处:
for (Iterator<Segment> nodeIterator=segment.getNoteIterator(); nodeIterator.hasNext();) {
Segment nodeSegment=nodeIterator.next();
if (nodeSegment instanceof Tag) {
Tag tag=(Tag)nodeSegment;
// HANDLE TAG
// Uncomment the following line to ensure each tag is valid XML:
// writer.write(tag.tidy()); continue;
} else if (nodeSegment instanceof CharacterReference) {
CharacterReference characterReference=(CharacterReference)nodeSegment;
// HANDLE CHARACTER REFERENCE
// Uncomment the following line to decode all character references instead of copying them verbatim:
// characterReference.appendCharTo(writer); continue;
} else {
// HANDLE PLAIN TEXT
}
// unless specific handling has prevented getting to here, simply output the segment as is:
//writer.write(nodeSegment.toString());
}
In the // HANDLE CHARACTER REFERENCE
and // HANDLE PLAIN TEXT
are where you want to add your string appending code. // HANDLE CHARACTER REFERENCE
在“ // HANDLE CHARACTER REFERENCE
和// HANDLE PLAIN TEXT
中,您要添加字符串附加代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.