简体   繁体   English

在Java上逐元素解析HTML

[英]Parsing HTML Element By Element at Java

I have an HTML file: 我有一个HTML文件:

 <div> DIV1 <div> DIV2 <div> DIV3 </div> </div> </div> 

I want to parse that HTML. 我想解析该HTML。 However I don't want to get whole parsed HTML as a string: 但是我不想将整个解析的HTML作为字符串获取:

DIV1 DIV2 DIV3

I would like to get values element by element but none of them duplicated. 我想逐个元素地获取值,但是没有一个重复。 I mean I don't want that: 我的意思是我不要:

When you get first div's value it is: 当您获得第一个div的值时,它是:

DIV1 DIV2 DIV3

Seconds div's value: 秒div的值:

DIV2 DIV3

Third div's value: 第三股的价值:

DIV3

The result that I don't want is: 我不想要的结果是:

DIV1 DIV2 DIV3
DIV2 DIV3
DIV3

I want that result: 我想要那个结果:

DIV1
DIV2
DIV2

I will apply some procedure to them and I don't want duplicated values as well. 我将对他们应用一些过程,并且我也不想重复的值。 I want to use a Java parser to solve my problem. 我想使用Java解析器解决我的问题。 I've considered to use Jsoup but you get entire HTML parsed when you use it. 我已经考虑过使用Jsoup,但是当您使用Jsoup时,您可以解析整个HTML。

It sounds like you want to do a pre order depth first search for all text nodes in an HTML document. 听起来您想对HTML文档中的所有文本节点进行预深度搜索 Luckily most parsing libraries including XML ones will give you all the nodes in pre order as an iterator. 幸运的是,大多数解析库(包括XML解析库)都会以迭代器的形式按顺序为您提供所有节点。

I recommend you use Jericho and call getNodeIterator() and just check to see if its a text node and if it is you print it out. 我建议您使用Jericho并调用getNodeIterator(),然后检查它是否是文本节点,以及是否将其打印出来。 Nootice the link has example code but I will paste it here for your convenience: Nootice该链接具有示例代码,但为了方便起见,我将其粘贴在此处:

 for (Iterator<Segment> nodeIterator=segment.getNoteIterator(); nodeIterator.hasNext();) {
   Segment nodeSegment=nodeIterator.next();
   if (nodeSegment instanceof Tag) {
     Tag tag=(Tag)nodeSegment;
     // HANDLE TAG
     // Uncomment the following line to ensure each tag is valid XML:
     // writer.write(tag.tidy()); continue;
   } else if (nodeSegment instanceof CharacterReference) {
     CharacterReference characterReference=(CharacterReference)nodeSegment;
     // HANDLE CHARACTER REFERENCE
     // Uncomment the following line to decode all character references instead of copying them verbatim:
     // characterReference.appendCharTo(writer); continue;
   } else {
     // HANDLE PLAIN TEXT
   }
   // unless specific handling has prevented getting to here, simply output the segment as is:
   //writer.write(nodeSegment.toString());
 }

In the // HANDLE CHARACTER REFERENCE and // HANDLE PLAIN TEXT are where you want to add your string appending code. // HANDLE CHARACTER REFERENCE在“ // HANDLE CHARACTER REFERENCE// HANDLE PLAIN TEXT中,您要添加字符串附加代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM